UC Irvine UC Irvine Electronic Theses and Dissertations

Title Development and Application of Computational Approaches in Drug Discovery

Permalink https://escholarship.org/uc/item/00m5c6qs

Author Zanette, Camila

Publication Date 2019

License https://creativecommons.org/licenses/by/4.0/ 4.0

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, IRVINE

Development and Application of Computational Approaches in Drug Discovery

DISSERTATION

submitted in partial satisfaction of the requirements for the degree of

DOCTOR OF PHILOSOPHY

in Pharmacological Science

by

Camila Zanette

Dissertation Committee: Professor David L. Mobley, Chair Professor Andrej Luptak Professor Celia W. Goulding

2019 © 2019 Camila Zanette DEDICATION

For Colin

ii TABLE OF CONTENTS

Page

LIST OF FIGURES vi

LIST OF TABLES xiv

LIST OF ALGORITHMS xvi

ACKNOWLEDGMENTS xvii

CURRICULUM VITAE xviii

ABSTRACT OF THE DISSERTATION xxi

1 Introduction 1

2 Toward learned chemical perception of force field typing rules 4 2.1 Abstract ...... 4 2.2 Introduction ...... 5 2.3 Methods ...... 8 2.3.1 Monte Carlo sampling over chemical perception trees ...... 13 2.3.2 SMARTY is an algorithm for learning indirect chemical perception trees 19 2.3.3 SMIRKY is an algorithm for learning direct chemical perception trees 22 2.3.4 SMARTY and SMIRKY were evaluated using multiple molecule sets compared to reference force fields ...... 28 2.4 Results and Discussion ...... 32 2.4.1 SMARTY ...... 32 2.4.2 SMIRKY ...... 43 2.5 Conclusions ...... 54

3 Challenging molecular test cases for implicit solvation 57 3.1 Abstract ...... 57 3.2 Introduction ...... 58 3.3 Methods ...... 60 3.3.1 FreeSolv Database ...... 60 3.3.2 Implicit Hydration Free Energy Calculation using GROMACS . . . . 61 3.3.3 We separate the molecules with positive electrostatic term ...... 62

iii 3.3.4 Reprocessing of Implicit Solvent Hydration Free Energy using SEA - Hybrid model ...... 63 3.3.5 1,4-dimethylpiperazine study case ...... 64 3.4 Results and discussion ...... 66 3.4.1 Our hybrid method improved the overall accuracy of the hydration free energies ...... 67 3.4.2 A closer look at the 1,4-dimethylpiperazine implicit solvent model result 69 3.4.3 Comparing series of molecule types ...... 74 3.5 Conclusions ...... 77 3.6 Acknowledgement ...... 78

4 Predicting conformational poses of the lissoclimides for translation inhibition using in silico and in cerebro analysis 79 4.1 Abstract ...... 79 4.2 Introduction ...... 80 4.3 Methods ...... 83 4.3.1 We first prepared the structures for docking calculations ...... 84 4.3.2 Different docking methods were tested ...... 84 4.3.3 We calculated the shape-similarity of the ligands ...... 85 4.3.4 The new chlorolissoclimide structure influenced the later part of our study ...... 85 4.3.5 Analysis ...... 86 4.4 Results and Discussion ...... 86 4.4.1 Tanimoto score ...... 87 4.4.2 Structure-Activity Relationship ...... 88 4.4.3 Molecules without IC50...... 92 4.4.4 Docking versus Crystal Structure ...... 93 4.5 Conclusion ...... 94

5 Progress toward the comprehensive computational design of a macrocyclic peptide 96 5.1 Abstract ...... 96 5.2 Introduction ...... 97 5.3 Methods ...... 99 5.3.1 Initial structures preparation ...... 100 5.3.2 simulations ...... 101 5.3.3 Analysis ...... 101 5.4 Results and Discussion ...... 101 5.5 Conclusion ...... 108

Bibliography 110

iv A Appendix - Progress toward the comprehensive computational design of a macrocyclic peptide 129

v LIST OF FIGURES

Page

2.1 Describing a specific chemical group with SMARTS. An example of how a particular type of hydroxyl oxygen in a molecule (top) can be described using SMARTS strings. In this case, the SMARTS pattern (bottom) is for a hydroxyl connected to a carbon which is itself connected to another carbon. The alpha atoms (light blue) for the oxygen base type (“[#8]”) are one hydrogen (“[#1]”) and one carbon (“[#6]”), and they are connected to the base type via a single bond (“-”). The carbon (“[#6]”) atom is connected to another carbon (“[#6]”, which is considered a beta atom (pink)) via any bond (“∼”). The oxygen has a decorator (light green) (“X2”) which means it has two connected atoms. . 11 2.2 Illustration of the parm99/parm@Frosst hydrogen atom types for AlkEthOH set. This specific set has five different hydrogen types, OH (high- lighted in blue), HC (highlighted in yellow), H1 (highlighted in green), H2 (highlighted in lilac), and H3 (highlighted in pink) atom types. Reference atom types labels for each hydrogen atom are shown in italics next to each atom. SMARTS recovered by SMARTY for these atom types are shown in the corresponding color...... 12

vi 2.3 SMARTY and SMIRKY. Workflows for the SMARTY (left) and SMIRKY (right) tools are shown. There are three input data categories for SMARTY and SMIRKY (shown at top). In SMARTY (on the left), there are input files (such as base types, initial atom types, and decorators), molecules (in- put via a molecule set file), and reference data consisting of typed molecules (parm99/parm@Frosst atom typing was used in this work). SMIRKY has sim- ilar inputs, shown in the purple area on the right, and additionally allows the user to set the decorator odds and its reference data is fragment types (here, from smirnoff99Frosst). The algorithms are represented in the green area and are available on GitHub (https://github.com/openforcefield/smarty). Both tools begin by reading and processing the input files (top part of the green area). SMARTY (on the left) then conducts a series of moves (coral area) consisting of choosing one working atom type per iteration (icon with connected atoms), and deleting or modifying (pencil icon) this atom type, making choices made with equal probabilities (Section 2.3.2). SMIRKY (on the right) is similar, but samples over working fragment types (such as non- bonded types, bonds, angles, and proper and improper torsions; in the figure these are represented by two different icons with connected atoms) and uses weighted probabilities on their choices (Section 2.3.3). The acceptance crite- ria (diamond icon) for both tools are the same, using Equation 2.3 (Section 2.3.1). Both tools run a user-specified number of iterations and iterate over the steps in the coral area (repeat icon), then write out a final results file after all iterations are completed...... 14 2.4 Illustration of the bipartite graph used to calculate the score for a new proposed move. (a) A bipartite graph is created with a node for each working (light red) and reference (light blue) atom type. (b) Edges (represented by lines) are created for all possible connections between the nodes of the two sets. Initially each edge has a weight of zero. () For each atom, the weight is incremented by one on the edge connecting that atom’s working and reference atom types. In this figure, we illustrate the case for potential graph matches to hydrogen working atom type (“#1”) using the AlkEthOH molecule set (Section 2.3.4). However, in practice this process is applied to all working atom types simultaneously. (d) Each node is then restricted to have only a single edge, such that the total weight is maximized. 16

vii 2.5 Flowchart illustrating the decision tree used for SMARTY move proposals. The numbers attached to certain arrows correspond to the prob- ability of following that arrow; all moves are made with equal probability in SMARTY. The diamonds represent decisions and the rectangles are pro- cesses. We start the SMARTY move proposals with the working atom types and decide randomly with equal probability if we want to remove or add an atom type to the set. If we decide to remove, we randomly choose a working type to be removed and re-score the new set. If we decide to add, we pick a working atom type from the set and check if it has an alpha substituent. If the answer is no, we either add an alpha substituent or a decorator. If the answer is yes, we first check if the working atom type describes a hydrogen, and if yes, we add a beta substituent, if not, we either add an alpha or a beta substituent. The new atom type SMARTS created is added to the end of the list of its corresponding parent. The end result (black circle on the bottom) is a move proposal which is then evaluated via scoring function (Section 2.3.1). 21 2.6 Mapping between ChemicalEnvironment and corresponding SMIRKS pattern. This illustrates a SMIRKS pattern “[#6X3H2,#7X2H1;A+0:1]-[#1:2]” (middle) for a substructure (top) being partitioned and how it would be stored in a ChemicalEnvironment object (bottom). The SMIRKS pattern (middle) represents a bond parameter “-” (gray) between two labeled atoms — a carbon (light blue) or nitrogen (light blue) atom and a hydrogen atom (light red). In this figure, we show two matches for this SMIRKS pattern on the top structure. The matches are highlighted in light blue (atom 1) and light red (atom 2) cir- cles on the substructure (top). In the ChemicalEnvironment representation, at bottom, atom 1 (light blue rectangle), there are two ORtypes ‘’‘#6X3H2” with a carbon atom base “#6” and OR decorators “[‘X3, ‘H2’]”, correspond- ing to a connectivity of three and a total hydrogen count of one and “#7X2H1” with nitrogen atom base “#7” and OR decorators “[‘X2, ‘H1’]” (light green). The AND decorators for the atom 1 are “[‘A’, ‘+0’]” (light green) corre- sponding to an aliphatic atom with a zero charge. In SMIRKS strings, the high precedence “and” operator is given by a semi-colon “;”, so the bond described by this pattern is one between a carbon atom with connectivity three and hydrogen count two or a nitrogen atom with connectivity two and hydrogen count one that is also aliphatic with zero charge...... 25

viii 2.7 Flowchart to illustrate the decision tree used for SMIRKY move proposals. The numbers attached to certain arrows correspond to the prob- ability of following that arrow, as currently implemented in SMIRKY. The diamonds represent decisions and the rectangles are processes. We start the SMIRKY move proposals by choosing an initial fragment type SMIRKS from the set. SMIRKY then decides randomly with specified probabilities whether it will remove or add a fragment type SMIRKS. To add, the algorithm chooses between making a change to an atom or a bond. If the change is in a bond, SMIRKY picks OR or AND decorators to change, but if the change is in an atom, SMIRKY can pick OR decorator or OR base to change; and then choose between deleting, adding or swapping one of those decorators. The last step is to choose if a change to an outer atom should be symmetric, if so both outer atoms are changed. Also, if SMIRKY chooses to change an atom, it can delete or add a new atom to this atom. The new fragment type is added to the end of its parent’s list. The end result (black circle) is a move proposal which is then evaluated via the scoring algorithm described in Section 2.3.1. * - The probability of “yes” for this particular decision is 0.2, 0.6, and 0.15 for bond, angle, and torsion types respectively)...... 27 2.8 Total score as a function of iteration for SMARTYorig sampling AlkEthOH. The graph shows the total score (fraction of atoms matching the parm99/parm@Frosst reference atom types) as a function of iteration varies with the temperature for SMARTYorig sampling AlkEthOH. The lines in the graph represent the temperatures 0 (pink), 0.0001 (orange), 0.01 (green) and 0.1 (gray). We can see a high variation throughout the simulation and a pre- dominance of lower scores for extreme temperatures such as 0 and 1.0. At intermediate temperatures, such as 0.01 and 0.0001, SMARTY scored higher. The Supporting Information includes plots for simulations at all temperatures. 34 2.9 Frequency of success for SMARTYorig on the AlkEthOH and PhEthOH sets. Heat maps show the results of SMARTYorig for the AlkEthOH (A) and PhEthOH (B) molecule sets. We plot the fraction of simulation time that each reference atom type (x axis) is matched by a SMARTS pattern with a par- tial score of 100%. We tested eight different temperatures (y axis) and ran 10 simulations of 10,000 iterations each. Darker colors imply a higher rate of matching 100% of that reference atom type; see color bars at right. If we never found the reference atom type, the corresponding cell would be white, but this does not occur here...... 36

ix 2.10 Example of a SMARTY hierarchy and inheritance in AklEthOH. This figure shows the final hierarchy achieved in one SMARTY simulation with the AlkEthOH set at T = 0.0001 after 10,000 iterations. The top shows the hierarchy of SMARTS strings discovered where each color represents a different level. For example, the initial types are represented in light blue, and the yellow squares are the working atom types derived from the initial types; the same logic applies for the other colors. For hydrogen, we see SMARTS patterns that match the parm99/parm@Frosst reference types HC (yellow), H1 (light green), H2 (lilac), and H3 (light red). The bottom (large gray box) shows all final working types and their respective reference atom types as well as partial score for that pairing...... 37 2.11 Comparison of SMARTYelem (hydrogen and oxygen) versus SMARTYorig (all atoms) on AlkEthOH. Here we determine the first iteration at which each sampler achieves a 100% total score (numerical values, out of 10,000 in total), then average this number for all runs which were successful in achieving a score of 100% at some point (out of 10 trials in total). The y axis shows which sampler was used (SMARTYorig for all atoms versus SMARTYelem for hydrogen and oxygen) and the x axis shows the temperature employed. In general lighter colors indicate higher success rates, and lower numbers indicate success was achieved more rapidly. Results for SMARTYelem for carbon is not shown since AlkEthOH has only one carbon atom type (CT)...... 39 2.12 Frequency of success for SMARTYelem on the MiniDrugBank set. For SMARTYelem for MiniDrugBank, we plot the fraction of simulation time where the partial score is 100% for a given parm99/parm@Frosst atom type. The elements shown here are (a) hydrogen, (b) carbon, (c) nitrogen, and (d) oxygen. We used 10 simulations of 10,000 iterations each to construct the plots at different temperatures (y axis). At high temperatures, most of the reference atom types reach 100% partial score. However, at low temperatures, there are more well populated reference atom types but also more which never achieve 100% at any iteration...... 40 2.13 SMARTY scoring with the bipartite graph. We illustrate how SMARTY scores working atom types (red on the left) compared to reference atom types (blue on the right) using a bipartite graph. (a) When there is only one generic SMARTS string (“[#6]”), this pattern is assigned to both CT and CA atom types through the PATTY-adapted scheme, but it will be matched only to the more populated reference atom type (CT highlighted in yellow); (b) when SMARTY creates a new set of working types, containing both “[#6]” and “[#6X4]”, the generic atom type will not match CT anymore, since it chemically matches the new working atom type (“[#6X4]”) instead...... 42

x 2.14 Total score versus iteration for SMIRKY sampling torsion fragment types in AlkEthOH. At a temperature of zero (pink line), SMIRKY works solely as a stochastic optimizer and gets stuck in a local optimum. When temperatures are too high, such as 0.1 (gray line), the probability of accepting moves that cause a decrease in score is very high, leading to dramatic changes in score throughout the simulation. At more moderate temperatures, such as 0.001 (blue line), a total score of 100% is eventually achieved...... 46 2.15 Frequency of success for SMIRKY simulations with AlkEthOH and PhEthOH. For torsion fragments in both AlkEthOH (a) and PhEthOH (b), we plot the fraction of simulation time where each reference fragment type has a torsion SMIRKS pattern that receives a partial score of 100%. For AlkEthOH, at temperatures below 0.001, all of the reference fragments spend at least 50% of the simulation time with a 100% partial score. However, for PhEthOH, many of the reference fragments spend a significant amount of simulation time without a 100% partial score...... 48

3.1 Thermodynamic path for calculating the ∆GSEA for our hybrid method. Svac and SGB are the solvation energies using GROMACS in vacuum and GB, respectively. SSEA is the solvation energy from SEA using the 5000 trajectories from the implicit solvent hydration free energy using GROMACS. ∆GSEA was calculated using ∆∆GGB→SEA and ∆Ghyd (implicit solvent)...... 63 3.2 Implicit (yellow dots) and hybrid solvent (lilac dots) versus explicit solvent [1] hydration free energy for the FreeSolv database molecules. (a) Linear re- gression (red line) analysis of implicit solvent model (GBOBC ) values (on the horizontal axis) and explicit solvent model values (on the vertical axis). The molecules that presented positive value of the electrostatic energy are shown in black triangles. The free energies distributions for the implicit and explicit solvent methods are shown on the top and the right side of the graph, re- spectively. The R2 is equal to 0.8. (b) Linear regression (red line) analysis of hybrid SEA values (on the horizontal axis) and explicit solvent model values (on the vertical axis). The free energies distributions for the hybrid SEA and explicit solvent methods are shown on the top and the right side of the graph, respectively. The R2 is equal to 0.95...... 67 3.3 Hydration free energy distributions for explicit [1], implicit, and hybrid sol- vent hydration free energy calculations for the FreeSolv database. (a) Explicit (gray) and implicit (yellow) solvent hydration free energy distributions; (b) Explicit (gray) and hybrid SEA (lilac) solvent hydration free energy distri- butions; (c) Implicit (yellow) and hybrid SEA (lilac) solvent hydration free energy distributions; ...... 69 3.4 Molecules from the FreeSolv database that showed positive GB polarization values...... 70 3.5 GB polarization energy for 1,4-dimethylpiperazine. We changed the radius size of the nitrogen (wine line) and hydrogen (tan line), separately, and cal- culated their GB polarization energies. Errors are shown in silver lines. . . . 72

xi OBC 3.6 Hydration free energies for a series of molecules types. a) Gel from the GB in kcal/mol. b) Explicit solvent hydration (beige), implicit solvent (dark red), SEA(light lilac), and GBNSR6 (sand) hydration free energies for seven molecules. The electrostatic term for 1-methylpyrrolidine is 0.065 ± 0.005. No explicit solvent hydration free energy is shown for 1-methylpyrrolidine as it is not part of the FreeSolv database...... 76

4.1 Cycloheximide and Lactimidomycin, respectively, 2D structures...... 80 4.2 Cycloheximide (magenta) and Lactimidomycin (purple) structures in the binding pocket of the 60S ribosome from S. cerevisiae [2]. Hydrogen bonds are represented in red and black dashed lines for CHX and LTM, respectively. 81 4.3 2D structures of the ligands tested in this study and their activities. The molecules that were tested their 50% inhibitory concentrations (IC50) against P388 leukemia cells. and (B) the ones that were not test their activity. 82 4.4 2D structures of the ligands tested in this study with no activities. The molecule structures that were not test in P388 leukemia cells to get their activity...... 82 4.5 Most likely binding mode for (A) chlorolissoclimide, (B) its analogue, (C) haterumaimide J, and (D) its analogue...... 87 4.6 Chlorolissoclimide crystal structure in the binding site of the S. cerevisiae 80S ribosome, hydrogen bonds are in black dashed lines...... 88 4.7 Most likely binding poses for compounds (A) 37, (B) 38, (C) 39, (D) 41, (E) 40, (F) hatN, (G) 3168, (H) hatQ, (J) 393, (K) 45, (L) dichlorolissoclimide, (M) hatJ, (N) anti1, (O) anti2,and (P) C3AC in the binding pocket of ribosome. Hydrogen bonds are represented in black dashed lines...... 89 4.8 Ligands bound to the 80S ribosome. (a) Co-crystal structure of CL, (b) docking-based structure of analogue 39, (c) docking-based structure of analogue 45, (d) docking-based structure of analogue 37...... 92 4.9 Compound 45 bound to the 80s ribosome. The two halogen-π interac- tions on the facing guanines (G2793/4) are in red dashed lines. The hydrogen bonds are in black dashed lines...... 93 4.10 Comparison of crystal structure and docking pose prediction. Co- crystal chlorolissoclimide (magenta) and the binding mode predictions of (a) chlorolissoclimide (green) and (b) analogue (orange), their RMSD are 1.6 Å and 1.2 Å, respectively. The hydrogen bonds between the co-crystal CL and ribosome are shown in black dashed lines...... 94

5.1 Crystal structure of CdiA-CT (green) and CdiI (cyan)...... 98 5.2 Crystal structure of macrocyclic peptide [3]...... 99 5.3 Hydrogen bonds average per residue for alanine mutations (mut1 to mut11) to Cdil. The asterisk (*) represents where the alanine mutation is for each mutated peptide. Some of alanine mutations drastically reduced the average number of hydrogen bonds between the peptide and Cdil protein...... 104

xii 5.4 RMSD of the backbone for the (a) ligand and (b) complex ligand-protein for mutation 1 to 11. All mutated peptide showed a constant RMSD and variations below 0.3 nm. The peptide alone shows more flexibility than when it is bound to the protein...... 106 5.5 Secondary structures analysis (DSSP). Secondary structures of the (A) origi- nal macrocyclic peptide, (B) mut12 and (C) mut2 over the time. The original macrocyclic peptide kept its β-sheet structure considerable stable along the time. The mut12, which has two cysteine mutations to form a bridge disulfide bridge, lost its β-sheet structure along the time and also the turn was com- promised. The mut2 is a more conservative mutation, where a serine residue was switched to an alanine, the β-sheet structure was kept along the time, but the peptide lost its bent shape...... 107 5.6 Hydrogen bonds average per residue for mutations (mut13 to mut20). The as- terisk (*) represents where the mutations is for each mutated peptide. Some of mutations drastically reduced the average number of hydrogen bonds between the peptide and Cdil protein...... 108

xiii LIST OF TABLES

Page

2.1 Decorators for elaborating SMIRKS and SMARTS strings. A selection of decorators that can be used for atoms (top section) and bonds (middle section) in SMIRKS or SMARTS strings. Decorators on atoms and bonds can be combined using Boolean operators (bottom section), where the high precedence and is applied before an or operator and the low precedence and is applied after. For a complete list of decorators and documentation for SMARTS and SMIRKS see the Daylight Theory Manual [4]...... 10 2.2 Average of average total scores of 10 SMARTY simulations with AlkEthOH and PhEthOH. We took the average score at each temperature for each of our 10 simulations (10,000 iterations each), then averaged this across all 10 simulations. * - uncertainties were estimated by the standard error over all 10 trials...... 33 2.3 Initial and maximum total scores for all SMARTY simulations with AlkEthOH, PhEthOH, and MiniDrugBank. Initial scores are the score from the initial types in each molecule set. Maximum scores are the highest total score measured during a single iteration in any simulation performed. SMARTYorig scores are shown in the ‘all’ row, and SMARTYelem scores are shown broken out in the ‘hydrogen’, ‘carbon’, ‘oxygen’, ‘nitrogen’, and ‘sulfur’ rows...... 35 2.4 Initial and maximum total scores achieved across all SMIRKY simu- lations with AlkEthOH, PhEthOH, and MiniDrugBank. Initial scores are the score from the initial types for each fragment type in each molecule set. Maximum scores are the highest total score measured during a single iteration in any simulation performed...... 44 2.5 Number of fragment types with 100% partial score in all SMIRKY simulations. This table summarizes results for all SMIRKY simulations. For each molecule set, we report the fraction of reference fragment types that get a 100% partial score during at least one simulation. It shows that we were successful in identifying all fragment types in the AlkEthOH and PhEthOH sets where all fractions are equal to one. As with SMARTY, we were still partially successful for MiniDrugBank given that we find the majority of fragment types...... 45

xiv 2.6 Average of average total scores of 10 SMIRKY simulations with AlkEthOH. We took the average score at each temperature for each of our 10 simulations (10,000 iterations each), then averaged this across all 10 sim- ulations. * - uncertainties were estimated by the standard error over all 10 trials...... 47 2.7 Average of average total scores of 10 SMIRKY simulations with PhEthOH. We took the average score at each temperature for each of our 10 simulations (10,000 iterations each), then averaged this across all 10 sim- ulations. * - uncertainties were estimated by the standard error over all 10 trials...... 49 2.8 Average of average total scores of 10 SMIRKY simulations with MiniDrugBank. We took the average score at each temperature for each of our 10 simulations (10,000 iterations each), then averaged this across all 10 simulations. * - uncertainties were estimated by the standard error over all 10 trials...... 51

3.1 Electrostatic (Gel) and hydration free energies (∆Gsolv) in kcal/mol for 1,4- dimethylpiperazine using AMBER in different solvation models and radii. . . 73 3.2 Comparison of results with experimental hydration free energies using the full free energy scheme (∆G) and several different implicit solvation models (kcal/mol)...... 74

4.1 Shape overlay results using ROCS Software ...... 90

5.1 List of mutations created from the original macrocyclic peptide. The amino acids mutated from the original peptide are in bold letters...... 103 5.2 RMSF of the backbone of mut1 to mut11 using GROMACS...... 105

xv List of Algorithms

Page 1 Macrocyclic peptide auto run ...... 129 2 Macrocyclic peptide analysis ...... 130

xvi ACKNOWLEDGMENTS

Over the course of my graduate career, I have learned a wealth of information from David Mobley. I would like to extend my deepest gratitude to my advisor for giving me the opportunity to be part of his lab and teach me . His patience and hard work are greatly admired by me. I also would like to thank the entire Mobley group for their friendship, advise and help along those years. The lab has such a positive atmosphere that I could not ask for better peers.

During my graduate journey, I was blessed with the opportunity to work with incredible scientists in different projects that provided me an exceptional experience. Thank you to Chris Vanderwal and Zef Könst for giving me the opportunity to learn and collaborate with you. I also would like to thank the brilliant scientists behind the Open Force Field consortium for their invaluable scientific insights, especially Chris Bayly, Michael Gilson, Michael Shirts, and John Chodera. And especial thank you to Chris Fennell for his help with SEA.

I had the privilege to meet incredible professors at UCI who shaped my scientific develop- ment. Andrej Luptak was a great mentor, his breadth of knowledge and attention to his students is something I aspire to emulate. Celia Goulding is the synonymous of girl power, I admire her incredible scientific knowledge, her positive energy, and how she finds time to surf (my husband hopes I will join you someday). Luckily, I could have Andrej and Celia as committee members — thank you for everything.

My transition from Brazil to the US was not easy and the friends that I made here in Irvine made everything much lighter — thanks to Bev, Tom, Riann, Bao, Mona, and Rachel. I could not forget to thank my Brazilian friends, Amanda and Adriana, who supported me since the beginning and helped me during the difficulty times. Also, thank you Kirsten and Alex for your great friendship along those years (and for giving me a “forced break” from grad school to go to Hawaii for your wedding).

I would like to thank my family for their unconditional love. My family has helped me throughout the process, especially my mother and father who have been a never ending source of support and guidance. In 2008, I met the love of my life and partner in crime, Guilherme, who was responsible for awakening my interest in science. His love, guidance, and support helped me to go through this long five years of grad school. This dissertation would not be done without you. With Gui, I got an incredible second family, which makes me feel hugely blessed due to all the support and caring from his parents and brother.

Colin, my love, you are so little and have taught me so much. Thank you for the opportunity of being your mother. Te amo! And Summer, you went through all this process with me, I could not forget to thank you!

My dissertation was supported by the Brazilian Science without Borders program (Capes) and LASPAU, National Science Foundation, National Heath Institute, and Department of Pharmaceutical Sciences at the University of California.

xvii CURRICULUM VITAE

Camila Zanette

xviii Camila Zanette PHARMACEUTICAL SCIENCES · COMPUTATIONAL CHEMISTRY

[email protected] |  www.camilazanette.com |  camizanette |  czanette Education Ph.D. — University of California Irvine (UCI) Irvine, CA, USA PHARMACOLOGICAL SCIENCE - ADVISOR: DAVID L. MOBLEY Sep 2013 - PRESENT • Computational chemistry M.S. — University of California Irvine (UCI) Irvine, CA, USA PHARMACOLOGICAL SCIENCE - ADVISOR: DAVID L. MOBLEY Sep 2013 - Dec 2016 • Computational chemistry M.S. — University of Sao Paulo (USP) Sao Paulo, SP, Brazil NUCLEAR TECHNOLOGY - APPLICATION (RADIOPHARMACEUTICALS) - ADVISOR: ELAINE BORTOLETI DE ARAUJO Feb 2011 - May 2013 • Study of the production process of radiopharmaceutical FLT-18F in automatic system: process validation and analytical methodology B.S. — State University of Maringa (UEM) Maringa, PR, Brazil PHARMACY (SPECIALIZATION IN PHARMACEUTICAL INDUSTRY) - ADVISORS: JOSIANE M. DE MELLO AND REGINA A. C. Mar 2006 - Jan 2011 GONÇALVES

Skills Computational Chemistry Molecular Modeling, Molecular Dynamics, Docking, Virtual Screening, Free Energy Calculations, Chemoinformatics Bioinformatics Genetics, Sequence Alignment (Blast), Selection Analysis, Motif Searching, Protein-protein Interaction Programming Python, , LaTeX, C/C++, bash, R, Perl, Machine Learning Analytical Methods HPLC, CG, Chromatography Languages Portuguese (native), English (fluent), Spanish (basic)

Experience

Agendia Irvine, CA BIOINFORMATICIAN Feb 2018 - present • Data analysis in molecular diagnostics • Data handling and transfer to clinical trials • QC monitoring • Software update and validation University of California Irvine Irvine, CA GRADUATE STUDENT RESEARCHER - MOBLEY LAB [MOBLEYLAB.ORG] Sep 2013 - present • Performed solvation free energy calculations of 642 small molecules, with experience in 4 different software (GROMACS, OpenEye toolkit, OpenMM, Amber). • Used bioinformatics tools and structure-based search methods to find GTP aptamers in BLAST database (Luptak Lab.) • Worked on parallel molecular dynamics simulations of protein and macrocyclic peptide engineering using GROMACS. • Applied 3 different computational docking software (AutoDock Vina, OEDocking, and Schrödinger) and shape similarity technique to study structure-activity relationships of lissoclimide compounds (translation inhibitors) in ribosome in collaboration with Vanderwal Lab. • Worked collaboratively on code development of a chemical perception tool in the innovative Open Force Field Group [openforce- field.org] to sample atom types and fragment types using Monte Carlo optimization and statistical Bayesian Inference. Sequencial Technical School Sao Paulo, Brazil PROFESSOR Jul 2011 - Dec 2011 • Taught introductory concepts in pharmaceutical formulation development in the Pharmacy Technician Program. • Prepared and proctored quizzes and exams. • Taught laboratory practices in pharmaceutical formulation.

MARCH 20, 2019 CAMILA ZANETTE · RESUME 1

xix

xx ABSTRACT OF THE DISSERTATION

Development and Application of Computational Approaches in Drug Discovery

By

Camila Zanette

Doctor of Philosophy in Pharmacological Science

University of California, Irvine, 2019

Professor David L. Mobley, Chair

The early stages of drug discovery is a long and costly process. A way to reduce the time, resources, and financial investment spent is to apply computational tools. With the tremen- dous progress and achievements in computational chemistry, the application of computa- tional tools in the drug lead discovery and design has increased. Today, computational chemistry is considered a highly valuable and well established tool in drug discovery.

A variety of computational chemistry methods are used for guiding molecular design and finding new potential drugs and targets. Some examples of these methods are molecular dynamics simulations, free energy calculations, virtual screening, structure-activity relation- ship analysis, and so on. Despite the fact that computational chemistry technics are widely used in industry and academic environment, there is still room for improvement.

Here, I present several studies in which I developed and applied computational chemistry tools in drug discovery problems. The first study is the development of a tool as part of the Open Force Field consortium to learn chemical perception of force fields typing rules. Secondly, I describe my work on using a new hybrid method to calculate free energies of small molecules. Thirdly, I present a binding mode prediction study to help the understanding of structure-activity relationship in lissoclimides. Lastly, I present a study applying molecular dynamic simulation to guide the redesigning of a macrocyclic peptide.

xxi Chapter 1

Introduction

Over 50 years ago, scientists developed the first tools in computational chemistry. In 1965, Allinger et al. [5] presented the first computational tool for molecular mechanics (MM) methods. Three years later, the consistent force field (CFF) program was developed by Lifson and Warshel [6]. In 1971, the Protein Data Bank, database widely used for the three- dimensional structural data of large biological molecules, was established. From that point on, the computational chemistry has been constantly growing and now it is a well stablished tool in drug discovery.

The use of computational chemistry tools at early stages of the drug discovery process are extremely helpful. The early stages of the discovery process is when researchers search for molecules which bind to a biomolecular target with sufficient affinity and suitable properties to become a pharmaceutical drug [7, 8, 9, 10]. The process of finding new drugs often requires an exhaustive search through gigantic libraries of molecules, consumes a good amount of resources, and . Therefore, the computational chemistry plays an important role in this process by reducing resources spent during the early stage. For example, the computational chemist can hit thousand of compounds to a specific target using virtual screening in order

1 to find potential drug candidates, and select only the best hits to be tested further.

In this work, I present my main results where I develop and apply different computational chemistry methods to solve problems in the area. I had a chance to work in different aspects of the computational chemistry area, using docking, molecular dynamics simulations, shape overlay, performing free energy calculations, and developing new tools.

Chapter 2 is adapted from Zanette, C., Bannan, C. C., Bayly, C. I., Fass, J., Gilson, M. K., Shirts, M. R., ... & Mobley, D. L. Journal of chemical theory and computation, 15(1), 402- 423. (2018), where I describe our work on developing a novel approach to sample over atom or fragment types typically used in biomolecular force fields, so that these definitions can ultimately be learned automatically without a human expert. The tools developed in this work are freely available to the community on GitHub along with Open Force Field Initiative software (https://github.com/openforcefield). These algorithms set the stage for future applications in which the scoring function will be based on the agreement of simulations with experimental, quantum, or other reference data.

In Chapter 3 (in preparation), I present the study in solvation free energy calculation, where we developed a new hybrid methodology to calculate the hydration free energy of small molecules. Along with the free energy calculations, we found interesting test cases for future implicit methods.

In Chapter 4 is adapted from Könst, Z. A., Szklarski, A. R., Pellegrino, S., Michalak, S. E., Meyer, M., Zanette, C., et al. Nature chemistry, 9(11), 1140 (2017) and Pellegrino, S., Meyer, M., Könst, Z. A., Holm, M., Voora, V. K., Kashinskaya, D., Zanette, C. et al. Nucleic acids research, 1 (2019). In this study, we collaborated with different laboratories to study lissoclimides analogues and their structure-activity relationship. Our findings using in silico docking method and in cerebro analysis were very helpful to guide the study and clarify interesting aspects of those compounds binding to the ribosome.

2 In Chapter 5 is a progress made toward the redesign of a macrocyclic peptide to interrupt bacterial contact-dependent growth inhibition.

3 Chapter 2

Toward learned chemical perception of force field typing rules

2.1 Abstract

Molecular mechanics force fields define how the energy and forces in a molecular system are computed from its atomic positions, thus enabling the study of such systems through com- putational methods like molecular dynamics and Monte Carlo simulations. Despite progress toward automated force field parameterization, considerable human expertise is required to develop or extend force fields. In particular, human input has long been required to define atom types, which encode chemically unique environments that determine which parame- ters will be assigned. However, relying on humans to establish atom types is suboptimal. Human-created atom types are often developed without statistical justification, leading to over- or under-fitting of data. Human-created types are also difficult to extend in a system- atic and consistent manner when new chemistries must be modeled or new data becomes available. Finally, human effort is not scalable when force fields must be generated for new

4 (bio)polymers, compound classes, or materials. To remedy these deficiencies, our long-term goal is to replace human specification of atom types with an automated approach, based on rigorous statistics and driven by experimental and/or quantum chemical reference data. In this work, we describe novel methods that automate the discovery of appropriate chemical perception: SMARTY allows for the creation of atom types, while SMIRKY goes further by automating the creation of fragment (nonbonded, bonds, angles, and torsions) types. These approaches enable the creation of move sets in atom or fragment type space, which are used within a Monte Carlo optimization approach. We demonstrate the power of these new meth- ods by automating the rediscovery of human defined atom types (SMARTY) or fragment types (SMIRKY) in existing small molecule force fields. We assess these approaches using several molecular datasets, including one which covers a diverse subset of DrugBank.

2.2 Introduction

Molecular simulations can provide detailed views of chemical and biological events that in- volve conformational changes and noncovalent binding, such as allostery, protein folding, and ligand-protein binding [11, 12, 13]. They also can be used to estimate various experimental observables, including solvation free energies of small molecules [14, 15, 16], binding free energies of small molecules to proteins and molecular hosts [17, 18, 19, 20], and the on and off rate constants for noncovalent association events [21, 22]. Molecular simulation technologies rely on potential functions, or force fields, mathematical functions which estimate the energy of the molecular system and the forces on its atoms as a function of the atomic coordinates. The force field used in a simulation is a critical determinant of the accuracy of the results. The centrality of the force field has motivated decades of pioneering and innovative research and development [23, 24, 11, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37]. In spite of these efforts, recent studies indicate that force field issues still significantly limit the accuracy of

5 simulations [38, 39, 40, 41, 42, 43, 44, 19, 45, 46, 47, 48, 49, 50, 51].

The Open Force Field Initiative seeks a systematic approach to the continuing challenge of improving force fields, reducing the human effort required and generating new force fields that make statistically sound use of appropriate reference data [52]. One major goal of our effort is to automate the development of new force fields given a choice of functional form and reference data. Later, we aim also to automate decisions about what functional forms to use and what reference data to fit. These capabilities will dramatically reduce the human time required to create new force fields, while also producing more accurate force fields with clear dependencies on the underlying data. Such capabilities would also support advances in force field science, such as determining what functional form achieves the best accuracy for particular applications at a specified level of computational cost. For example, a force field that explicitly treats electronic polarizability and includes fixed electric multipoles, such as AMOEBA [53, 54, 55, 56, 35, 57, 58, 59], should be able to reach higher accuracy than a force field with fixed, atom-centered point partial charges. But how much additional accuracy do multipoles and polarizability actually provide? We cannot currently answer these questions because the question of the functional form is conflated with other issues, such as the use of different input data, different atom types, different fitting methods, and differences in the chemical intuition brought to the problem. In contrast, an automated approach would allow systematic evaluation of the benefits of an advanced functional form within a given context.

Although some force field parameterization tools already exist [60, 61, 62, 63, 64, 65, 66, 67], the full process of force field development has not been automated. For example, the parameterization software ForceBalance [68, 69, 70] advanced the field by automating the adjustment of force field parameters against experimental and theoretical observables with a gradient-based optimizer. However, the researcher must still address not only how to weight the components of the objective function [34, 71, 72, 73] but also the fundamental question of what atom types to use.

6 As we seek to automate parameterization, it is important to take note of where human exper- tise is typically employed, and one key place is in determining which chemical environments (usually treated as atom types) will be treated separately by a force field. This process of distinguishing between different chemical environments in order to assign force fields pa- rameters we call the chemical perception. Chemical perception has been a key ingredient in building general purpose small molecule force fields [74, 75, 76, 45, 77, 78, 79, 80]. Chem- ical perception in current force fields largely consists of assigning atom types or fragment (nonbonded, bond, angle, torsion) types, defined using human intuition. Ideally, force field fitting should involve adjusting not only the numerical parameters of a force field, but also the chemical perception. For example, we could start a force field parameterization process with only a single type of carbon-carbon bond, but the automated process might propose adding a second bond type, thus allowing distinct parameters to be assigned to single versus double bonds. If this proposal led to improved accuracy sufficient to justify the increased number of parameters, it would become part of the force field definition. Thus, we need a way to sample not only over the numerical parameters of a force field, but also over its atom or fragment types. However, we are not aware of any existing algorithm or software tools to carry out this type of automated parameter sampling.

In this paper, we address this fundamental problem with a novel approach to sampling over atom or fragment types typically used in biomolecular force fields, so that these definitions can ultimately be learned automatically without a human expert. In particular, we introduce methods of sampling over hierarchical chemical perception trees. That is, we organize force field atom or fragment types into hierarchical trees, such that child types contain more specific types than parent types. We utilize the SMARTS and SMIRKS substructure definition languages [4, 81] to specify atom types and more general fragment types which may be associated with distinct numerical parameters. Then, we show how a Monte Carlo scheme can be used to sample over chemical perception trees defined using these languages. Both tools for sampling chemical perception trees, SMARTY (atom types) and SMIRKY (fragment types),

7 are freely available to the community on GitHub along with Open Force Field Initiative software (https://github.com/openforcefield). As a proof of principle for this approach, we compare the chemical complexity of sampled atom and fragment types to those in existing force fields. These algorithms set the stage for future applications in which the scoring function will be based on the agreement of simulations with experimental, quantum, or other reference data.

2.3 Methods

We consider two different approaches for assigning force field parameters to a molecule: di- rect and indirect chemical perception[82]. Indirect chemical perception is exemplified by the traditional approach to parameter assignment (such as the AMBER and CHARMM force field families) in which, once a molecule’s atoms have been typed, all other informa- tion about chemical environments (including bond orders) is discarded [83]. All force field parameters, including valence terms, are then assigned using only the atom types and the way they are connected [84]. Thus, in indirect chemical perception, atom types encode the information needed to assign all valence and bonded parameters [85, 86, 79, 87, 88]. Alter- natively, in direct chemical perception, parameters are assigned based on the full molecular graph—a full valence representation of the molecule including elements, connectivity and bond orders, rather than just atom types and local connectivity. This approach thus in- volves direct analysis of the full molecular graph, including bond orders, instead of indirect analysis through connected atom types, as in indirect chemical perception. As previously detailed, direct chemical perception allows force fields to be fully specified with far fewer numerical parameters than required with indirect chemical perception[82].

We recently introduced a new force field format, the SMIRKS Native Open Force Field (SMIRNOFF) [82] which uses SMIRKS patterns to allow direct assignment of force field

8 parameters, thereby implementing direct chemical perception. This is a substantial break from indirect chemical perception which uses a graph labeled with atom types to assign pa- rameters. SMIRNOFF instead uses direct chemical perception, using substructure searches via different SMIRKS strings to assign parameters when the target molecular substructures are encountered. Thus, the chemical perception and parameters for each force term are separate from those applied to other force terms. For example, a new set of Lennard-Jones parameters could be introduced without needing to introduce additional valence terms, or vise versa.

In this paper, we introduce approaches to sample chemical perception trees with both tradi- tional atom types and SMIRNOFF fragment types. Here, we use the term chemical percep- tion tree to describe a hierarchical classification of molecular substructures in order to assign parameters. One of our major interests is to sample over a variety of chemical perception trees to see if we can match the chemical perception used in existing force fields. We further use the term fragment type to refer to the more generalized notion of an “atom type” used by direct chemical perception — a particular substructure that would be assigned a par- ticular parameter. The SMIRNOFF format, then, has four categories of fragment types — bond, angle, torsion, and nonbonded types — corresponding to the valence and nonbonded interaction force field parameters.

In order to automate sampling of chemical perception trees, a language to express atom or fragment types directly was required. We utilize the SMILES Arbitrary Target Specification (SMARTS) and SMILES Reaction Specification (SMIRKS) languages for this purpose [4]. A SMARTS string is a chemical substructure query, where a substructure is a set of atoms connected by bonds, and both atoms and bonds are typically further characterized with “decorators” (Table 2.1). For example, a bond may be decorated with “@” to indicate that it in a ring, and/or “-“ to indicate a single bond, and detailed specifications may be constructed by the use of Boolean operators. Atoms are set apart from bonds with square brackets, for

9 example, “[#6X4]!@[#6r5]” describes a tetrahedral carbon atom (“[#6X4]”) connected by a non-ring bond (“!@”) to another carbon atom which is in a five-membered ring (“[#6r5]”). SMIRKS strings provide a language similar to SMARTS but which also includes atom indexing. SMIRKS were created to allow description of reactions, but here we use only the atom indexing feature. For our purposes, we take advantage of the indexing in SMIRKS to track relevant atoms involved in fragment types such as bond, angle, or torsion types. For example, the SMARTS above could become a SMIRKS string with the addition of “:” to identify the atom indices (“[#6X4:1]!@[#6r5:2]”). Following the SMIRNOFF notation, a bond parameter involves two indexed atoms, an angle parameter three, and so on [82].

Symbol Definition #n atomic number * any atom A aliphatic Atom a aromatic Hn hydrogen count Xn connectivity ±n charge ∼ any bond Bond @ ring bond − single bond , logical or & high precedence logical and Boolean modifiers ; low precedence logical and ! logical not

Table 2.1: Decorators for elaborating SMIRKS and SMARTS strings. A selection of decorators that can be used for atoms (top section) and bonds (middle section) in SMIRKS or SMARTS strings. Decorators on atoms and bonds can be combined using Boolean operators (bottom section), where the high precedence and is applied before an or operator and the low precedence and is applied after. For a complete list of decorators and documentation for SMARTS and SMIRKS see the Daylight Theory Manual [4].

The atom and fragment type definitions used here are based on the chemical environment of each atom extending up to two bonds away. In both SMARTY and SMIRKY, we only consider the chemical environment around the primary atom and atoms one bond away (alpha) or two bonds away (beta). Moves propose changes at any of these positions. The

10 primary atom is the atom being typed by SMARTS or all of the indexed atoms in a SMIRKS. Atoms beyond this beta position are not currently considered. For example, consider a SMARTS describing the hydroxyl oxygen in an alcohol (Figure 2.1). Here, the oxygen is the atom to be typed (yellow), the hydrogen and carbon bonded to the oxygen are the alpha atoms (blue), and the carbon two bonds away is a beta atom (pink), as are the ring oxygen and the hydrogen atom bonded to the alpha carbon (not included in the SMARTS pattern). Including just atoms up to two bonds away leads to a wide variety of possible SMARTS or SMIRKS patterns. The number of possible SMARTS patterns that can be generated using environments including atoms only out to the beta position is on the scale of 1016, estimated using the number of possible atoms in one pattern and the number of SMARTS decorators available. Atom types in most present-day force fields usually depend only on atoms out to the beta position, although future force fields might require going further. Therefore, SMARTY and SMIRKY currently do not propose new atom or fragment types that involve atoms past the beta position.

Base Type Alpha

[#8X2$(*-[#1])$(*-[#6]~[#6])]

Decorator Bonds Beta

Figure 2.1: Describing a specific chemical group with SMARTS. An example of how a particular type of hydroxyl oxygen in a molecule (top) can be described using SMARTS strings. In this case, the SMARTS pattern (bottom) is for a hydroxyl connected to a carbon which is itself connected to another carbon. The alpha atoms (light blue) for the oxygen base type (“[#8]”) are one hydrogen (“[#1]”) and one carbon (“[#6]”), and they are connected to the base type via a single bond (“-”). The carbon (“[#6]”) atom is connected to another carbon (“[#6]”, which is considered a beta atom (pink)) via any bond (“∼”). The oxygen has a decorator (light green) (“X2”) which means it has two connected atoms.

11 Like existing atom typing schemes [89, 81], both SMARTY and SMIRKY take a hierarchical approach to defining atom and fragment types. That is, the SMARTS and SMIRKS strings speci- fying types are listed in a specific order, and the last string that matches an atom or fragment is the one assigned (“last one wins”). For example, in Figure 2.2, HC (“[#1$(*-#6)]”) would match all hydrogens bound to carbons, but then the string H1 (“[#1$(*-[#6]∼[#8])]”) overrides HC on the hydrogens with one oxygen in the beta position. In the SMARTS language the “$” is used to specify neighboring atoms, in “[#1$(*-[#6]∼[#8])]” the hydrogen is connected to a carbon bonded to an oxygen. This hierarchy allows a general pattern to catch all hydrogens that are not described by more specific patterns later in the list. We call a complete, hierarchical type specification of this sort a chemical perception tree. The same approach can be used on a hierarchical list of SMIRKS patterns for fragment types. The PATTY algorithm [90] was developed to automatically assign traditional atom types from a hierarchy, where each atom was assigned the last type matched to it. We adapted PATTY for use in SMARTY and SMIRKY. Specifically, the set of SMARTS or SMIRKS is matched to a molecule using OpenEye’s OEChem Toolkit [91, 92, 93] and then only the last pattern to match a given atom or fragment is stored.

[#1$(*­[#6])] HC HC [#1$(*­[#6]~[#8])] H1 HC

H1 HC

HO [#1] H3

[#1$(*­[#6](­[#8])(­[#8])~[#8])] H2

[#1$(*­[#6](­[#8])~[#8])] HO

Figure 2.2: Illustration of the parm99/parm@Frosst hydrogen atom types for AlkEthOH set. This specific set has five different hydrogen types, OH (highlighted in blue), HC (highlighted in yellow), H1 (highlighted in green), H2 (highlighted in lilac), and H3 (highlighted in pink) atom types. Reference atom types labels for each hydrogen atom are shown in italics next to each atom. SMARTS recovered by SMARTY for these atom types are shown in the corresponding color.

12 In the subsections below, we describe methods of varying the SMARTS and SMIRKS strings used to define atom or fragment types, respectively, in order to sample over chemical percep- tion trees. We first provide a general description for our Monte Carlo sampling procedure including how we sample over chemical perception trees with a scoring function based on the agreement of the sampled atom or fragment types with those of an existing force field [94, 95, 96, 97]. Next, we detail two software packages for testing this procedure. The first is SMARTY, which learns chemical perception trees in the setting of traditional indirect chemical perception, such as the atom types found in an AMBER-family force field (Section 2.3.2). The second is SMIRKY, which learns chemical perception trees using direct chemical perception, such as the fragment types in a SMIRNOFF-format force field. Our description of SMIRKY focuses on those aspects which differ from SMARTY (Section 2.3.3). Both approaches are diagrammed in Figure 2.3. We then describe how we evaluate the ability of these methods to sample chemical perception trees of the chemical complexity found in the reference force fields. This analysis mirrors our ultimate goal of using a scoring function to measure the ability of a chemical perception tree, combined with suitable numerical param- eters, to replicate a set of reference experimental and/or quantum chemical data. Finally, we describe the molecule sets used and details of simulations run to test both SMARTY and SMIRKY (Section 2.3.4).

2.3.1 Monte Carlo sampling over chemical perception trees

We use the Metropolis Monte Carlo (MC) algorithm [98, 99] to sample over chemical per- ception trees by making changes to a set atom or fragment types. We call this set of atom or fragment types being changed the “working types” and to be more specific we would say “working atom types” or “working fragment types”. A single iteration of the algorithm comprises

13 SMARTY SMIRKY

MOLECULES MOLECULES

INPUT REFERENCE INPUT REFERENCE FILES DATA FILES + ODDS DATA

PROCESS PROCESS

CHOOSE ATOM CHOOSE TYPE SMARTS PARAMETER TYPES SMIRKS

ITERATIONS ITERATIONS CHANGE CHANGE

(EQUAL PROB) (WEIGHTED PROB)

ACCEPT OR ACCEPT OR REJECT PROPOSAL REJECT PROPOSAL

RESULT RESULT

Figure 2.3: SMARTY and SMIRKY. Workflows for the SMARTY (left) and SMIRKY (right) tools are shown. There are three input data categories for SMARTY and SMIRKY (shown at top). In SMARTY (on the left), there are input files (such as base types, initial atom types, and decorators), molecules (input via a molecule set file), and reference data consisting of typed molecules (parm99/parm@Frosst atom typing was used in this work). SMIRKY has similar inputs, shown in the purple area on the right, and additionally allows the user to set the decorator odds and its reference data is fragment types (here, from smirnoff99Frosst). The algorithms are represented in the green area and are available on GitHub (https://github.com/openforcefield/smarty). Both tools begin by reading and processing the input files (top part of the green area). SMARTY (on the left) then conducts a series of moves (coral area) consisting of choosing one working atom type per iteration (icon with connected atoms), and deleting or modifying (pencil icon) this atom type, making choices made with equal probabilities (Section 2.3.2). SMIRKY (on the right) is similar, but samples over working fragment types (such as nonbonded types, bonds, angles, and proper and improper torsions; in the figure these are represented by two different icons with connected atoms) and uses weighted probabilities on their choices (Section 2.3.3). The acceptance criteria (diamond icon) for both tools are the same, using Equation 2.3 (Section 2.3.1). Both tools run a user-specified number of iterations and iterate over the steps in the coral area (repeat icon), then write out a final results file after all iterations are completed.

14 1. choosing an atom or fragment type at random from the working set

2. proposing a move in which the type is either deleted (if it is not in the base set) or used as the starting point to create a new, more specific type

3. computing the change in a scoring function due to the proposed move

4. using the Metropolis criterion [98] to either accept or reject the proposed move, where the scoring function plays the role of the energy.

The effective temperature used in the Metropolis accept/reject decision and the desired number of iterations are user-specified inputs. Note that, if the user-defined temperature is zero, SMARTY and SMIRKY act as strictly optimizers, only accepting moves which result in a higher total score, whereas at nonzero temperatures, it is possible for a move with a decreased score to be accepted.

The scoring function measures agreement of a chemical perception tree with that of an existing force field

For this proof of principle study, we developed a scoring function that quantifies the agree- ment of a chemical perception tree proposed in the course of MC sampling with the chemical perception assignments associated with an existing, operational force field. This scoring function compares how the working types categorize atom and fragments, with reference types from the existing force field. Our basic problem here is to determine which of our working types best corresponds to which reference type when applied to the same set of molecules. Here, in explaining the scoring function, we focus on atom types, but the same approach is also used for other fragment types arising from direct chemical perception, as considered at the end of this subsection.

We use a bipartite graph with a maximum weight matching [100, 101] to score the working

15 (a) (b) Working atom Reference Working atom Reference

types atom types types atom types hydrogen hydrogen OH OH [#1] [#1]

carbon carbon H2 H2 [#6] [#6]

oxygen oxygen H3 H3 [#8] [#8]

H1 H1

H0 H0

HC HC

OS OS

CT CT

(c) (d) Working atom Reference Working atom Reference types atom types types atom types hydrogen hydrogen 0 OH OH [#1] [#1] 33 carbon carbon 68 H2 H2 [#6] 3 [#6]

oxygen oxygen 116 H3 H3 [#8] [#8] 68 224 H1 H1 224 0 H0 H0 0 232 HC HC

OS OS

CT CT

Figure 2.4: Illustration of the bipartite graph used to calculate the score for a new proposed move. (a) A bipartite graph is created with a node for each working (light red) and reference (light blue) atom type. (b) Edges (represented by lines) are created for all possible connections between the nodes of the two sets. Initially each edge has a weight of zero. (c) For each atom, the weight is incremented by one on the edge connecting that atom’s working and reference atom types. In this figure, we illustrate the case for potential graph matches to hydrogen working atom type (“#1”) using the AlkEthOH molecule set (Section 2.3.4). However, in practice this process is applied to all working atom types simultaneously. (d) Each node is then restricted to have only a single edge, such that the total weight is maximized.

16 types (Figure 2.4). A bipartite graph is a graph with vertices divided into two disjoint sets, X and Y, where each edge only connects a vertex in X with a vertex in Y; that is, there are no X-X or Y-Y edges. Here, set X comprises the working atom types and set Y comprises the reference atom types. To compose a graph for scoring, we first process a set of molecules (Section 2.3.4), assigning every atom a working atom type and reference atom type. The set of working atom types in the molecules become vertices in set X and the set of reference atom types in the molecules become vertices in set Y (Figure 2.4(a)). The graph is initialized by construction of an edge of zero weight between each node in X and each node in Y (Figure 2.4(b)). Then, for each atom in each molecule, we identify its working and reference atom types and increment the weight of the edge joining the corresponding set X and set Y vertices by one (Figure 2.4(c)).

We then determine the set of X-Y edges that maximizes the number of atoms assigned the same working types that are also assigned the same types in the reference set. This corresponds to a maximal weight graph-matching problem [102, 103]. A matching set in a graph is a subset of edges that are non-adjacent; i.e., no two edges share a common vertex [103]. Thus, determining the matching set ensures we never have the same working atom type connected by edges to two or more different reference types, and vice versa. For example, in Figure 2.4(c), we would select only one of the edges for the “#1” node. A maximal weight graph matching is one in which the sum of the weights of the retained edges is maximized. Thus, in Figure 2.4(c), we select only the edge of highest weight, here 224. More generally, a maximal weight graph match would include all possible working and reference types, such as the result for the three working types in Figure 2.4(d).

Given the resulting maximum weight graph matching, we use partial scores from each re- maining edge and a total score for the graph to evaluate the set of working atom types. We

17 define the partial score for each reference atom type as

Ni,edge Si = (2.1) Ni,mol

where Ni,edge is the weight of the edge associated with reference atom type i and Ni,mol is the number of atoms in the molecule set with reference atom type i. Thus, Si = 1 provides an assessment for how accurately the working atom type captures the chemistry of a specific reference atom type. The total score across types then is

n P Ni,edge i=1 ST otal = (2.2) NT

Pn where n is the number of reference atom types and NT = i Ni,mol is the total number of atoms in the molecule set. Thus the MC acceptance criterion is given by

Sproposed−Sprevious R < e T (2.3)

where R is a random number from 0 to 1, Sproposed and Sprevious are the graph match scores computed for the proposed and previous move, respectively, and T is the temperature as- signed by the user.

The atom type scoring function just described pertains to the SMARTY method of sam- pling indirect chemical perception trees, which define only atom types. For the SMIRKY algorithm, which samples direct chemical perception trees, the atom type scores are supple- mented with analogous scores for bond, angle and torsion types.

18 2.3.2 SMARTY is an algorithm for learning indirect chemical per-

ception trees

The SMARTY algorithm defines its chemical perception tree using an ordered list of SMARTS strings, as described above, and manipulates these strings in to order vary and optimize this chemical perception tree using a selected scoring function. In this work, we use the graph- based scoring function defined in Section 2.3.1 to measure similarity to the atom types in a target force field. Ultimately, our initiative we will use a scoring function that reflects the ability of a force field using the working chemical perception tree to replicate experimental data. In this subsection, we describe the algorithm, including user specified inputs and move sets in SMARTS space.

User specified inputs allow for customization

The present SMARTY implementation takes five input files. The first is a file containing a set of SMARTS strings defining “base atom types” or base types, which will not be changed. These are typically the most generic type definitions. The second is a file containing “initial atom types,” which form a superset of the base types. Typically, the initial types include not only generic base types, but also some more specific types, with the generic types listed before the specific types to define a hierarchical chemical perception tree. Initial types provide an initial configuration for the simulation in the search space of types. The third is a file listing the SMARTS atom decorators which will be used during sampling (see examples in Table 2.1). These three files are provided in a format where the first entry in each line is a complete SMARTS string or SMARTS decorator, and the second item is an informal name for the string or decorator. For example, “[#7] nitrogen” could be an entry in the initial type file, and “X4 connectivity-4” could be an entry in the decorator file. The other two input files relate to our specific test in this study, where we seek to determine how well sampled chemical

19 perception trees can capture the typing of an existing force field (Section 2.3.1). Thus, our fourth input file provides the set of molecules to be typed in the process of evaluating a chemical perception tree, and the fifth contains the same molecules labeled with the atom types associated with the target force field.

After processing the user provided input, there is a preparation step before sampling can begin. The first step is to assign the base and initial atom types to the molecules via SMARTS chemical matching to substructures and to remove those base and initial types that match no atoms, in order to simplify the sampling problem. The remaining initial atom types become the working set of atom types. After the completion of this preparation phase, our atom type sampling works via the MC algorithm detailed above (Section 2.3.1). We next detail the specific move set used in SMARTY.

Move set for Monte Carlo sampling over atom type chemical perception trees

As illustrated in Figure 2.5, an MC move proposal either removes a non-base atom type from the working set of atom types or creates a more specific child type from an existing atom type, A. For creation of a new child, A’, if the definition of A comprises only the primary atom, a decorator may be added to it; or an alpha substituent may be added. When a new substituent is added, the type of bond connecting it, single, double, triple, or aromatic, is also chosen. If A already has an alpha substituent, then A’ may be created by adding a decorator to A; by adding a second alpha substituent; or by adding a beta substituent to any non-hydrogen alpha substituent. Finally, if A already has an alpha and a beta substituent, then a new decorator may be added to A, or a new alpha or beta substituent may be added. It is possible that in the future, move types might be added to allow the addition of decorations to alpha and beta substituents, but those moves are not included in our current SMARTY tests.

20 Figure 2.5: Flowchart illustrating the decision tree used for SMARTY move pro- posals. The numbers attached to certain arrows correspond to the probability of following that arrow; all moves are made with equal probability in SMARTY. The diamonds represent decisions and the rectangles are processes. We start the SMARTY move proposals with the working atom types and decide randomly with equal probability if we want to remove or add an atom type to the set. If we decide to remove, we randomly choose a working type to be removed and re-score the new set. If we decide to add, we pick a working atom type from the set and check if it has an alpha substituent. If the answer is no, we either add an alpha substituent or a decorator. If the answer is yes, we first check if the working atom type describes a hydrogen, and if yes, we add a beta substituent, if not, we either add an alpha or a beta substituent. The new atom type SMARTS created is added to the end of the list of its corresponding parent. The end result (black circle on the bottom) is a move proposal which is then evaluated via scoring function (Section 2.3.1).

21 Before the chemical perception tree resulting from the move is scored, it is subject to several validity checks:

1. All atoms in the molecule set must be assigned an atom type from the proposed list.

2. A new child, A’, must match at least one atom in the molecule set.

3. A new child A’ must not be a duplicate of any other atom type.

4. The parent atom type A of a new child must still match at least one atom in the molecule set, unless it is a base type.

Note that criterion 2 eliminates any atom type specifications that violate valence restrictions, such as a carbon with more than four bonds. All of these criteria are applied when the molecules are typed with the full chemical perception tree, so criteria 2 and 4 will only hold if types A and A’ are both assigned to at least one atom. If these criteria are met, the proposed atom type set is valid and can be scored (Section 2.3.1).

The SMARTY implementation also offers the option of allowing proposed moves only for atom types involving a single element. For example, one might allow deletion and child creation for only carbon atom types. This elemental SMARTY sampler, SMARTYelem, avoids the combinatorial complexity that results when changes are allowed for all elements and thus can speed convergence to the global optimum.

2.3.3 SMIRKY is an algorithm for learning direct chemical percep-

tion trees

We recently argued that direct chemical perception [82] has major advantages over indi- rect chemical perception and illustrated the application of direct chemical perception via a

22 prototype smirnoff99Frosst force field in the new SMIRNOFF format [104]. The SMARTY algorithm (Section 2.3.2) is not adequate to sample over direct chemical perception trees, because it samples only over atom types, and not the independent bond, angle, and torsion types used in direct chemical perception. Therefore, we have developed a second algorithm, SMIRKY, to sample over chemical perception trees corresponding to these fragment types. SMIRKY uses the same Metropolis MC method to sample over a chemical perception tree, and also uses an analogous scoring function (Section 2.3.1). The differences are the use of more general fragment types instead of atom types, and the resulting differences in the move set. SMIRKY also differs in that it uses of unequal probabilities for move proposals. We made this choice based on the unequal distribution of different SMIRKS decorators in SMIRNOFF fragment types. These unequal probabilities come from two sources — odds specified by user in the input files (Section 2.3.3) and odds specified in the SMIRKY move set (Section 2.3.3). Unlike SMARTY’s elemental approach, we do not currently have a mech- anism for reducing the combinatorial problem associated with sampling SMIRKS patterns by limiting the considered chemical space. We now present a new tool developed specifically to manipulate SMIRKS strings and then detail the SMIRKY input files and move set.

ChemicalEnvironment objects are used to parse complex SMIRKS patterns

We created the ChemicalEnvironment Python object which divides a SMIRKS string into atoms connected by bonds, and stores the decorators associated with each (Figure 2.6). This object also uses the atom indexing in the SMIRKS string to allow extraction of any indexed atom, any atom alpha or beta to an indexed atom, or any associated bond. The user can then easily modify the ChemicalEnvironment by adding neighboring atoms, removing neighboring atoms, or changing the list of decorators for a given atom or bond. Each ChemicalEnvironment is loaded and output as a SMIRKS string, so this object integrates with any other tool that can generate and interpret SMIRKS.

23 The ChemicalEnvironment object categorizes decorators on atoms and bonds based on the type of Boolean operator used to combine them; that is, the decorators are separated into OR and AND types. For example, consider "[#6X3H2,#7X2H1;A+0:1]-[#1:2]" in Figure 2.6, which is a SMIRKS pattern for a bond type, and hence has two indexed atoms. For atoms, OR types are composed of a base atom type (typically an atomic number) and a list of decorators to be combined via a logical OR (“,”). In this example, atom 1 has two OR types (“#6X3H2” and “#7X2H1”) which are divided into OR bases (“#6” and “#7”) and their corresponding decorators. For atoms, AND types are decorators that will be combined via a logical AND (“;”). Our example atom 1 has two AND decorators (“A” and “+0”) which apply to both the carbon and nitrogen OR types. Bonds connecting atoms can also have OR and AND decorators. When no Boolean operator is used then the decorator is considered an OR type, so the bond in the present example, “-”, would be considered an OR.

SMIRKY inputs are like those for SMARTY but add user-selected move prob- abilities

The present SMIRKY implementation takes ten input files which allow users more cus- tomized control of probabilities while sampling. Three of these files are similar to SMARTY inputs: a file with initial fragment types (that is SMIRKS strings) defining a hierarchical chemical perception tree; a file containing molecules to be typed in the course of scoring the direct chemical perception tree; and a SMIRNOFF file which provides the reference fragment types for use in scoring. As in SMARTY, any of the initial fragment types that do not match any fragments are removed, creating a working set of fragment types for the MC algorithm. The next five files provide the atom, bond, and decorator types: OR atom types; OR and AND decorators for atoms; OR bond types; and AND decorators for bonds (Section 2.3.3). Each decorator file includes a column for the odds of proposing a given decorator in the course of the MC sampling. The final category of files allows the user to specify the odds of

24 Figure 2.6: Mapping between ChemicalEnvironment and corresponding SMIRKS pat- tern. This illustrates a SMIRKS pattern “[#6X3H2,#7X2H1;A+0:1]-[#1:2]” (middle) for a substructure (top) being partitioned and how it would be stored in a ChemicalEnvironment object (bottom). The SMIRKS pattern (middle) represents a bond parameter “-” (gray) between two labeled atoms — a carbon (light blue) or nitrogen (light blue) atom and a hydrogen atom (light red). In this figure, we show two matches for this SMIRKS pattern on the top structure. The matches are highlighted in light blue (atom 1) and light red (atom 2) circles on the substructure (top). In the ChemicalEnvironment representation, at bottom, atom 1 (light blue rectangle), there are two ORtypes ‘’‘#6X3H2” with a carbon atom base “#6” and OR decorators “[‘X3, ‘H2’]”, corresponding to a connectivity of three and a to- tal hydrogen count of one and “#7X2H1” with nitrogen atom base “#7” and OR decorators “[‘X2, ‘H1’]” (light green). The AND decorators for the atom 1 are “[‘A’, ‘+0’]” (light green) corresponding to an aliphatic atom with a zero charge. In SMIRKS strings, the high precedence “and” operator is given by a semi-colon “;”, so the bond described by this pattern is one between a carbon atom with connectivity three and hydrogen count two or a nitrogen atom with connectivity two and hydrogen count one that is also aliphatic with zero charge.

25 picking a certain atom or bond in a fragment that will be changed during the move. There are two files in this category — one for atoms and one for bonds. There are two columns in this format, one specifying if the move is to an indexed, alpha or beta atom or bond using the SMIRKS index number or the key words “alpha“ or “beta.” Then the second column is used to set the odds. For example, in a torsion, a user might want to make it more likely to make changes to the outside atoms (1 and 4) than to the inside atoms (2 and 3).

Move set for Monte Carlo sampling over fragment type chemical perception trees

SMIRKY uses the same general move set as SMARTY; that is, a fragment type A is chosen and then either deleted or used as the starting point for a child fragment type A’ (Figure 2.7). The input odds files for SMIRKY allow the user to customize sampling for a specific fragment type, but the probabilities determining which type of move, for example adding or removing a non-indexed atom, are set internally. When creating a child fragment type, the first choice made is whether to change an atom or a bond. If an atom is chosen, then decorators on the atom can be added, swapped, or removed; a new connected atom can be added as a neighbor; or if the atom selected is not an indexed atom it can be removed (indexed atoms must be retained in order for the fragment to be fully defined). An atom can be removed from a fragment definition only if it is connected to just one other atom; the associated bond is removed at the same time. If a bond is chosen instead of an atom, either type of decorator can be added, swapped, or removed. As with SMARTY, we only consider substructures that extend to the beta position of any indexed atom. Because there are more SMIRKS decorators available for atoms than bonds, the move set is weighted to choose atom moves more often than bond moves.

To allow efficient construction of complex SMIRKS patterns that make chemical sense, sym- metric moves are also sometimes proposed. In symmetric moves, equivalent modifications are proposed simultaneously to both outer atoms in a bond, angle, or torsion. The probability

26 Figure 2.7: Flowchart to illustrate the decision tree used for SMIRKY move pro- posals. The numbers attached to certain arrows correspond to the probability of following that arrow, as currently implemented in SMIRKY. The diamonds represent decisions and the rectangles are processes. We start the SMIRKY move proposals by choosing an initial fragment type SMIRKS from the set. SMIRKY then decides randomly with specified proba- bilities whether it will remove or add a fragment type SMIRKS. To add, the algorithm chooses between making a change to an atom or a bond. If the change is in a bond, SMIRKY picks OR or AND decorators to change, but if the change is in an atom, SMIRKY can pick OR decorator or OR base to change; and then choose between deleting, adding or swapping one of those decorators. The last step is to choose if a change to an outer atom should be sym- metric, if so both outer atoms are changed. Also, if SMIRKY chooses to change an atom, it can delete or add a new atom to this atom. The new fragment type is added to the end of its parent’s list. The end result (black circle) is a move proposal which is then evaluated via the scoring algorithm described in Section 2.3.1. * - The probability of “yes” for this particular decision is 0.2, 0.6, and 0.15 for bond, angle, and torsion types respectively). 27 of making a symmetric move is fixed inside SMIRKY. Specifically, the frequency of symmet- ric bond, angle, and torsion types in smirnoff99Frosst was used to assign the probability of making symmetric moves for each fragment type by category.

2.3.4 SMARTY and SMIRKY were evaluated using multiple molecule

sets compared to reference force fields

We sought to determine whether our machinery could discover SMARTS or SMIRKS patterns that replicate the chemical complexity of types in an existing force field, thus providing a proof of principle for automating a step in force field development that has in the past been done only by hand. In order to measure chemical complexity we defined success based on the total and partial scores for the sample types compared to reference types (Section 2.3.1). A total score of 100% corresponds to patterns matching all reference atom or fragment types and would indicate complete success for a set of working atom or fragment types. However, it is not necessary that we exactly reproduce the atom typing or fragment typing of existing force fields to succeed, especially in view of the combinatorial challenge we face; it is sufficient that we sample comparable chemical complexity. Thus, achieving a high total score on average would also be acceptable, especially if we can achieve at the same time a partial score of 100% for most atom or fragment types. That is, for both SMARTY and SMIRKY, we seek to recover SMARTS or SMIRKS which will mirror the previously defined indirect or direct chemical perception from the reference force field, by separating atoms or fragments into similar types. However, the automatically generated trees do not necessarily need to encode the exact same chemical perception trees as those from the reference force field.

To understand the evaluation of results, we consider the concrete example of sampling atom types for 1,3-dioxepane-2,4-diol, which has five different hydrogen atom types when typed

28 with parm99/parm@Frosst:

• hydrogen bonded to oxygen (HO),

• hydrogen bonded to an aliphatic carbon (HC),

• hydrogen bonded to an aliphatic carbon with one electron withdrawing substituent (H1),

• hydrogen bonded to an aliphatic carbon with two electron withdrawing substituent (H2), and

• hydrogen bonded to an aliphatic carbon with three electron withdrawing substituent (H3).

If SMARTY is successful, it will discover SMARTS strings that identify these types (Fig- ure 2.2), starting from a chemically undistinguished initial hydrogen base type. Finding all five reference types simultaneously would give a 100% total score, but we would also consider discovering SMARTS patterns that receive a 100% partial score for all of these types during different parts of a simulation a success. Partial success might come from distinguishing four for the five types instead.

The chemical perception trees sampled in both SMARTY and SMIRKY were scored against the typing in existing force fields (Section 2.3.1). For SMARTY, we used a traditional indirect chemical perception force field specification as the reference, namely, AMBER’s parm99 [105] with the parm@Frosst [106] extension. For SMIRKY, we used the direct chemical perception force field smirnoff99Frosst [104], which was generated by hand to closely follow the parameters in the parm99/parm@Frosst force field.

We developed three molecule sets to test SMARTY and SMIRKY. All three are included as images in the Supporting Information and molecule files in the Additional Information.

29 For the first set, our goal was to have a small set of molecules with minimal complexity to test whether these tools could work as intended. Thus, this set was limited to molecules composed of carbon, oxygen, and hydrogen atoms, with only single bonds; that is, to alkanes, ethers, and alcohols and the compounds were drawn from the AlkEthOH compound set [82]. The specific set of compounds used here comprises 42 molecules, with a total of 803 atoms. We refer to this minimal set as AlkEthOH in our results.

The second set, PhEthOH, is the first set additionally supplemented with aromatic com- pounds, and comprises 200 molecules with a total of 7,185 atoms. Chemically, the difference in these sets is that PhEthOH includes aromatic rings, along with alkanes, ethers, and al- cohols. AlkEthOH has one atom type not found in PhEthOH, namely H3, corresponding to a hydrogen bound to a carbon which is connected to three electron withdrawing groups. The aromatic groups in PhEthOH introduce two atom types not found in AlkEthOH — specifically, CA for aromatic carbons and HA for the hydrogens connected to those carbons. However, PhEthOH has more bond, angle, and fragment types than AlkEthOH.

The third set of molecules was obtained by filtering DrugBank, a free database of drug and drug-like molecules [107, 108, 109, 110]. Molecules with fewer than three or more than 100 heavy atoms were removed, as were any molecules with metals or metalloids. Next, the molecules were assigned parm99/parm@Frosst atom types, and any molecule that could not be typed was removed. The molecules were then also typed with smirnoff99Frosst version 1.0.5 and any molecules assigned a generic parameter, for example any bond (“[*:1]∼[*:2]”), were also removed. Finally, we selected a reduced set of compounds that include all atom and fragment types in the reference force fields used in DrugBank after this filtering. The final set, termed MiniDrugBank, has 371 molecules containing 15,678 atoms, and is available separately on GitHub (https://github.com/openforcefield/MiniDrugBank).

30 Total and partial scores were used to evaluate SMARTY and SMIRKY results

A series of tests were performed to evaluate success for each tool described above — the original SMARTY sampler (SMARTYorig), the elemental SMARTY sampler (SMARTYelem), and SMIRKY — in discovering chemical perception of the complexity in existing force fields. Each test was run on each of the three molecule sets described above, AlkEthOH, PhEthOH, and MiniDrugBank. For SMARTYelem, tests were performed for each element with more than one atom type in the molecule set. Using the elemental sampler on elements with only a single atom type (such as halogens) would not provide useful insight since even a generic SMARTS pattern could match the relevant chemical perception. SMIRKY tests were performed for bond, angle, proper torsions, and nonbonded (vdW) fragment types. While the SMIRNOFF format and SMIRKY support improper torsions, there are a very limited number in smirnoff99Frosst so those were not tested.

MC sampling was run at eight effective temperatures: 0, 10−6, 10−5, 0.0001, 0.001, 0.01, 0.1, and 1. Ten initial tests of 10,000 iterations were performed for each combination of molecule set, temperature, and tool. These were followed by three additional tests with first 50,000 and then 100,000 iterations for MiniDrugBank with SMIRKY, since the search space for fragment types is so large. Output for both tools was saved in two ways: a human-readable log of the run, and a comma-separated trajectory file which stores the partial score for each reference atom or fragment type and the total score at each iteration.

Output files and plots of score versus time for all simulations along with analysis scripts used for the results shown here can be found in the Additional Information. Source code and usage examples for SMARTY and SMIRKY can also be found on our GitHub repository (https://github.com/open-forcefield-group/smarty).

31 2.4 Results and Discussion

A central goal of this study is to test the ability of the SMARTY and SMIRKY sampling methods to discover type definitions closely matching those in existing force fields. As detailed above, if successful, we would recover SMARTS and SMIRKS patterns matching all atom or fragment types, by achieving total scores of 100% or partial scores of 100% for all reference types (Section 2.3.4). However, we would consider high total scores along with a 100% partial score for most reference types to also indicate success, showing promise for expanding these algorithms for sampling chemical perception trees beyond the proof of principle in this paper. Both methods were tested on three molecule sets of increasing complexity — AlkEthOH, PhEthOH, and MiniDrugBank (section 2.3.4).

2.4.1 SMARTY

We performed 10 SMARTY runs, of 10,000 steps each, for each of the three molecule sets, at effective temperatures of 0, 10−6, 10−5, 0.0001, 0.001, 0.01, 0.1, and 1, using both the

original SMARTY sampler (SMARTYorig) and the elemental sampler (SMARTYelem). The results of these calculations are analyzed in the following two subsections.

High scores are achieved for the AlkEthOH and PhEthOH molecule sets

We first tested SMARTY on the AlkEthOH and PhEthOH sets. Although these have only eight atom types, some of the five hydrogen types used in these sets require multiple atoms in the beta position, so they still provide a significant challenge. For the AlkEthOH set run

−6 −2 at effective temperatures between 10 and 10 , the original sampler, SMARTYorig, yielded total scores, averaged across the 10 runs, of > 90% (Table 2.2), and the lowest average score is still fairly high, at 69%. The degradation of performance at lower and higher temperatures

32 is reasonable for the MC method, as low temperatures can lead to trapping in local maxima, while high temperatures can lead to a lower preference for favorable states. Figure 2.8 provides sample plots of total score against step number for MC runs at various temperatures. As expected, the higher temperatures lead to greater score fluctuations. Equivalent results were obtained for the PhEthOH set (Table 2.2), which has a similar number of atom types (Section 2.3.4). To summarize results, we also report total scores for the initial atom types used in each simulation and the maximum total score received across all simulations (Table 2.3). Overall, these results strongly support the ability of SMARTY to sample chemically relevant ensembles of atom types.

Average Score (%)* Temperature AlkEthOH (%)* PhEthOH (%)* MiniDrugBank (%)* 1.0 68.9 ± 0.1 56.9 ± 0.1 42.98 ± 0.10 0.1 75.3 ± 0.1 62.2 ± 0.2 45.71 ± 0.19 0.01 94.4 ± 1.1 88.1 ± 1.4 67 ± 1 0.001 97.5 ± 0.8 96 ± 1 77.0 ± 1.1 0.0001 97.9 ± 0.9 95 ± 1 77.9 ± 1.6 1e-05 98.5 ± 0.7 94.6 ± 2.3 70.0 ± 1.5 1e-06 97.4 ± 0.8 95 ± 1 73.6 ± 1.7 0 88.6 ± 2.6 73 ± 4 open 72.2 ± 1.5

Table 2.2: Average of average total scores of 10 SMARTY simulations with AlkEthOH and PhEthOH. We took the average score at each temperature for each of our 10 simulations (10,000 iterations each), then averaged this across all 10 simulations. * - uncertainties were estimated by the standard error over all 10 trials.

Analysis of the partial scores offers a finer-grained view of these overall results. Thus, we made heat maps showing the frequency of generating a partial score of 100% for each reference atom type (Figure 2.9), at the various simulation temperatures. These show that some reference atom types are easier to discover than others. This can be for multiple reasons, including because they have simpler specifications, and because they occur more frequently. Also, although low to moderate temperatures yield the best results for most atom types, higher temperatures worked better for atom type HC, at least for the PhEthOH molecule set.

33 Temperatures 1.0 0.0001 0.01 0 0.9

0.8 0.1

0.7 Total score

0.6

0.5

0.4 0 500 1000 1500 2000 Iterations

Figure 2.8: Total score as a function of iteration for SMARTYorig sam- pling AlkEthOH. The graph shows the total score (fraction of atoms matching the parm99/parm@Frosst reference atom types) as a function of iteration varies with the tem- perature for SMARTYorig sampling AlkEthOH. The lines in the graph represent the tem- peratures 0 (pink), 0.0001 (orange), 0.01 (green) and 0.1 (gray). We can see a high variation throughout the simulation and a predominance of lower scores for extreme temperatures such as 0 and 1.0. At intermediate temperatures, such as 0.01 and 0.0001, SMARTY scored higher. The Supporting Information includes plots for simulations at all temperatures.

34 AlkEthOH (%) PhEthOH (%) MiniDrugBank (%) Element Initial Maximum Initial Maximum Initial Maximum All 67.8 100.0 54.5 100.0 40.5 93.0 Hydrogen 52.6 100 39.0 100 35.9 97.0 Carbon 100 100 71.0 100 39.0 95.7 Oxygen 63.6 100 84.1 100 38.3 98.0 Nitrogen n/a n/a n/a n/a 33.1 84.0 Sulfur n/a n/a n/a n/a 52.2 100

Table 2.3: Initial and maximum total scores for all SMARTY simulations with AlkEthOH, PhEthOH, and MiniDrugBank. Initial scores are the score from the initial types in each molecule set. Maximum scores are the highest total score measured during a single iteration in any simulation performed. SMARTYorig scores are shown in the ‘all’ row, and SMARTYelem scores are shown broken out in the ‘hydrogen’, ‘carbon’, ‘oxygen’, ‘nitrogen’, and ‘sulfur’ rows.

Figure 2.10 shows a sample working atom type hierarchy tree (top panel) that was generated during a SMARTY simulation, with levels of the hierarchy represented by different colors. The bottom panel shows which reference atom types are matched by each SMARTS pattern, and the partial score for each match. It is worth noting in this regard that SMARTY’s hierarchical approach to sampling dramatically simplifies the complex search problem. For example, in the parm99/parm@Frosst force field for AlkEthOH, there are four atom types for hydrogens bound to carbon, differing in the number of electron withdrawing groups bound to the carbon atom (i.e., in the beta position). As mentioned before, these hydrogens provide a significant challenge to SMARTY. With hierarchical sampling, SMARTY can first discover less specialized SMARTS and then proceed to more complex derivative types further down the tree (Figure 2.10). SMARTY would not be able to find H1 (“[#1$(*-[#6]-[#8])]”), a carbon with one electron withdrawing group, without first discovering HC (“[#1$(*-[#6])]”), because moves are made one atom at a time. This also highlights the importance of the hierarchy when matching; if the more complex child SMARTS patterns were placed above their parents, then all hydrogen atoms would be assigned to the more general “[#1]” type; the more complex pattern is only valuable when it is a child of and placed below the less complex pattern.

35 (a) (b)

Figure 2.9: Frequency of success for SMARTYorig on the AlkEthOH and PhEthOH sets. Heat maps show the results of SMARTYorig for the AlkEthOH (A) and PhEthOH (B) molecule sets. We plot the fraction of simulation time that each reference atom type (x axis) is matched by a SMARTS pattern with a partial score of 100%. We tested eight different temperatures (y axis) and ran 10 simulations of 10,000 iterations each. Darker colors imply a higher rate of matching 100% of that reference atom type; see color bars at right. If we never found the reference atom type, the corresponding cell would be white, but this does not occur here.

This point is further illustrated by considering the additional hydrogen types used in AlkEthOH. The hydrogen atom types used by parm99/parm@Frosst are relatively complex because pre- vious work concluded the nonbonded parameters for these atoms need to be different [111], and thus SMARTS patterns to match these must also be fairly complex. Specifically, these atom types require SMARTS strings extending out far enough to describe multiple atoms in the beta position. If SMARTY did not use SMARTS patterns already generated as parent types, finding these complex patterns would be impossible. Here, the relevant atom types cover hydrogen attached to carbon with zero (HC), one (H1), two (H2) or three (H3) electron withdrawing groups (Figure 2.2). For example, a SMARTS pattern which can recognize H3 (in AlkEthOH as shown in Figure 2.2 in red) is “[#1$(*-[#6](-[#8])(-[#8])-[#8])]”. But arriving at H3 directly from HC (“[#1$(*-[#6])]” ) would be impossible without very complex move proposals (yellow to red in Figure 2.2). Instead, a move in SMARTY could use HC as a parent type and propose a move to create H1. A subsequent move could take H1

36 Figure 2.10: Example of a SMARTY hierarchy and inheritance in AklEthOH. This figure shows the final hierarchy achieved in one SMARTY simulation with the AlkEthOH set at T = 0.0001 after 10,000 iterations. The top shows the hierarchy of SMARTS strings discovered where each color represents a different level. For example, the initial types are represented in light blue, and the yellow squares are the working atom types derived from the initial types; the same logic applies for the other colors. For hydrogen, we see SMARTS patterns that match the parm99/parm@Frosst reference types HC (yellow), H1 (light green), H2 (lilac), and H3 (light red). The bottom (large gray box) shows all final working types and their respective reference atom types as well as partial score for that pairing.

37 as a parent and generate H2 as a child then the same for creating H3 from H2. Of course, this will only work if uncovering each new intermediate type provides some incremental increase in total score, which it does here. Our assumption is that the chemistry of more diverse sets can be described in a similar way, starting with the most simple (or generic) SMARTS and then extending that pattern only where necessary.

The analysis so far has focused on the original SMARTY algorithm, which samples over all elements simultaneously. This approach is comprehensive, but leads to a combinatorial sampling problem. Imagine we want to find the hydrogen atom types in AlkEthOH. The most complex of these is H3, which requires specifying one atom in the alpha position and three atoms in the beta position (Figure 2.10). When adding either an alpha or beta substituent, the probability of choosing a specific base type (i.e. correct element) is approximately 0.15. This means it could potentially take SMARTY more than 0.15−4 ≈ 1, 000 moves to find H3 when starting with a hydrogen parent type. Including the necessity of choosing the correct parent element decreases the probability of each move to 0.015, meaning it could take SMARTY on the order of 10,000,000 moves to generate the correct SMARTS pattern for

H3. The elemental SMARTY method, SMARTYelem, speeds the search by sampling only one element (base type) at a time. Thus, fewer iterations are required to first discover a SMARTS with a total score of 100%, as shown for AlkEthOH in Figure 2.11.

MiniDrugBank provides a more difficult molecule set for testing SMARTY

The MiniDrugBank molecule set was drawn from the DrugBank database, and contains a similar diversity of atom types in a smaller number of molecules (Section 2.3.4). Initial and maximum scores across all SMARTY simulations for MiniDrugBank are shown in Table 2.3. The total number of reference atom types represented in MiniDrugBank, 37, is considerably larger than the number of reference atom types in AlkEthOH or PhEthOH, making this a more challenging test case. As a consequence, the average total scores run lower, often at

38 Figure 2.11: Comparison of SMARTYelem (hydrogen and oxygen) versus SMARTYorig (all atoms) on AlkEthOH. Here we determine the first iteration at which each sampler achieves a 100% total score (numerical values, out of 10,000 in total), then aver- age this number for all runs which were successful in achieving a score of 100% at some point (out of 10 trials in total). The y axis shows which sampler was used (SMARTYorig for all atoms versus SMARTYelem for hydrogen and oxygen) and the x axis shows the temperature employed. In general lighter colors indicate higher success rates, and lower numbers indicate success was achieved more rapidly. Results for SMARTYelem for carbon is not shown since AlkEthOH has only one carbon atom type (CT).

70-80%, than in the case of the simpler test sets (Table 2.2). We therefore focus further evaluation on the the elemental SMARTY sampler (SMARTYelem), which reduces the com- binatorial explosion of types and thus is particularly suitable for bigger and more complex molecule sets. Accordingly, the evaluation of these results is in terms of partial scores, rather than total scores (Section 2.3.4).

We find that SMARTYelem is capable of finding SMARTS patterns for the majority of reference atom types (Figure 2.12). Overall, SMARTY is more likely to match 100% of all reference atom types at relatively high temperatures. When we decrease the temperature, SMARTY’s behavior changes and no longer finds many reference atom types (shown by white in the heat maps), but it matches 100% of some atom types more frequently (darker colors), until the temperature becomes too low.

SMARTY is able to generate multiple different SMARTS patterns matching typical reference atom types in MiniDrugBank. In 10 runs of 10,000 iterations each, we found 16,564 unique SMARTS strings which match at least some fraction of any reference atom type. When we look at the SMARTS patterns which match 100% of any reference atom type, the number is still rather large, at 801 unique SMARTS strings. This was true even for simple atom

39 (a) Hydrogen (b) Carbon

(c) Nitrogen (d) Oxygen

Figure 2.12: Frequency of success for SMARTYelem on the MiniDrugBank set. For SMARTYelem for MiniDrugBank, we plot the fraction of simulation time where the partial score is 100% for a given parm99/parm@Frosst atom type. The elements shown here are (a) hydrogen, (b) carbon, (c) nitrogen, and (d) oxygen. We used 10 simulations of 10,000 iterations each to construct the plots at different temperatures (y axis). At high temperatures, most of the reference atom types reach 100% partial score. However, at low temperatures, there are more well populated reference atom types but also more which never achieve 100% at any iteration.

40 types, such as CM, which is represented by “[#6X3]” in parm99/parm@Frosst. For CM, SMARTY found 24 different SMARTS strings that matched 100% of this reference atom type. Some of these 24 SMARTS are very generic, such as “[#6]”, others more complex (“[#6X3$(*-[*])$(*=[#6])]”). Our goal was not to replicate exactly the same SMARTS pat- terns written by human experts, but instead to determine what extent automated sampling can capture equivalent — or at least similar — chemical perception. SMARTY’s ability to discover a wide diversity of SMARTS patterns corresponding to each reference atom type appears to demonstrate success in this regard.

When working types have only very generic SMARTS pattern they will match the most pop- ulated reference atom type due to our typing scheme and scoring function. The scoring function uses a bipartite graph where the matching is based on the maximum total weight and the weight on each edge is determined by the number of atoms assigned that working and reference type (Section 2.3.1). For this reason, a reference type with the highest population is more likely to be matched with a generic working type. To understand the implications of this, consider the frequency of two of the carbon types in MiniDrugBank, CT and CA. The parm99/parm@Frosst atom type CT is assigned to 2213 atoms, while CA is used for 2097. If we have only one carbon SMARTS string, such as the generic one “[#6]”, all carbon atoms in the molecule set will be assigned as “[#6]”, and after scoring, the SMARTS pattern will match the reference atom type CT because it leads to the highest total score. In that case, CA (and all other carbon atom types) are not matched to any SMARTS. But, if we discover a new working atom type that matches CT only, such as “[#6X4]”, then the generic SMARTS will match CA (Figure 2.13).

For MiniDrugBank we were only partially successful; we were not able to find a 100% partial score for the HX, N2, C*, and HP atom types with SMARTY. It is worth briefly examining these types to understand why. HX was created to address a very specific problem occurring with hydroxyl hydrogens; to match it, a SMIRKS pattern would need to describe an atom in

41 Figure 2.13: SMARTY scoring with the bipartite graph. We illustrate how SMARTY scores working atom types (red on the left) compared to reference atom types (blue on the right) using a bipartite graph. (a) When there is only one generic SMARTS string (“[#6]”), this pattern is assigned to both CT and CA atom types through the PATTY-adapted scheme, but it will be matched only to the more populated reference atom type (CT highlighted in yellow); (b) when SMARTY creates a new set of working types, containing both “[#6]” and “[#6X4]”, the generic atom type will not match CT anymore, since it chemically matches the new working atom type (“[#6X4]”) instead.

the gamma position (3 bonds away from the primary), but SMARTY sampling extends out only to the beta position. N2 requires multiple SMARTS patterns to be able to match 100% (it combines several different chemical groups into a single atom type) and the bipartite graph matching only allows one working atom type per reference atom type, meaning that SMARTY will never discover it. C* is very similar to other five-membered ring carbon atom types (CW, CC, and CR) in our tests; it also occurs less frequently which would require SMARTY to generate and keep highly specialized SMARTS patterns that match the other atom types uniquely before it could be discovered. In order to find HP, the SMARTS pattern would need a decorator in the beta position, but SMARTY does not currently add decorators to the beta position making it impossible to match this particular atom type. More details about all of these atom types and why they never get a partial score of 100% can be found in the Supporting Information.

Nevertheless, SMARTY was fairly successful finding a 100% partial score for most reference

42 atom types in parm99/parm@Frosst with only a few exceptions. In our tests, we were able to find patterns matching 33 out of 37 parm99/parm@Frosst atom types with a 100% partial score — overall an 89.2% success rate. While some reference atom types are not found, this seems to be less a commentary on our chosen sampling algorithm and move set; but a byproduct of the complex human decisions made in constructing parm99/parm@Frosst. The very complex atom typing employed in parm99/parm@Frosst may provide a further argument in favor of moving toward a force field with direct chemical perception and away from those with human defined atom types.

2.4.2 SMIRKY

SMIRKY extends SMARTY, testing our ability to sample over chemical perception trees used for fragment types in SMIRNOFF format force fields. It automatically samples SMIRKS patterns for fragment types for bond, angle, torsion, and nonbonded types then compares the fragment type SMIRKS with that in a reference force field for the same set of molecules. As with SMARTY, the overall goal with SMIRKY is to ensure we can find SMIRKS pat- terns that sample comparable chemical complexity to that used in our reference force field, smirnoff99Frosst. Scoring works the same as with SMARTY—we evaluate both the total score and the partial score for each iteration. In the case of SMIRKY, a partial score of 100% corresponds to SMIRKY discovering a SMIRKS pattern that matches a single reference fragment type (Equation 2.1).

With SMIRKY, we were at least partially successful with all molecule sets as evidenced by our ability to find high total scores and 100% partial scores for the majority of fragment types in smirnoff99Frosst. Initially, we performed simulations of comparable length with SMIRKY as with SMARTY; however, increased complexity made longer simulations nec- essary. Specifically, we began by performing 10 simulations with 10,000 iterations for each

43 fragment type at all eight temperatures (0, 10−6, 10−5, 0.0001, 0.001, 0.01, 0.1, 1). However, because SMIRKY goes beyond atom types to fragment types, it necessarily must sample more complex chemical perception trees. For example, torsion fragment types require at least four atoms and potentially substituents as well; decorators may also be required for all these atoms. This results in SMIRKS patterns which involve four or more atoms (three or more bonds) all with decorators—a challenging problem. In SMARTY, the combinatorial problem of discovering sufficiently complex SMARTS patterns was tackled by creating the ele- mental sampler in order to limit the chemical space being searched. But with SMIRKY, we cannot currently sub-divide the chemical space for these more complex fragments (Section 2.3.3), so we instead increased the number of iterations. For the most complex molecule set, MiniDrugBank, we increased the number of iterations to 50,000 and then 100,000. In each case, we performed three simulations at each temperature for each fragment type. To summarize our results, we report total scores for in the initial fragment types used in each simulation and the maximum total score received across all simulations (Table 2.4). We also report the fraction of reference fragment types to receive a 100% partial score during any simulation (Table 2.5).

AlkEthOH PhEthOH MiniDrugBank Fragment Type Initial Maximum Initial Maximum Initial Maximum Bond 91.5 100.0 78.9 100.0 59.1 96.2 Angle 80.3 100.0 65.2 100.0 48.8 93.6 Torsion 32.0 100.0 34.7 100.0 29.4 76.0 Nonbonded 30.4 100.0 24.2 99.7 18.1 99.5

Table 2.4: Initial and maximum total scores achieved across all SMIRKY simu- lations with AlkEthOH, PhEthOH, and MiniDrugBank. Initial scores are the score from the initial types for each fragment type in each molecule set. Maximum scores are the highest total score measured during a single iteration in any simulation performed.

44 Fragment Type AlkEthOH PhEthOH MiniDrugBank Bond 5/5 10/10 68/73 Angle 4/4 6/6 32/34 Torsion 11/11 16/16 117/136 Nonbonded 8/8 9/9 25/26

Table 2.5: Number of fragment types with 100% partial score in all SMIRKY simulations. This table summarizes results for all SMIRKY simulations. For each molecule set, we report the fraction of reference fragment types that get a 100% partial score during at least one simulation. It shows that we were successful in identifying all fragment types in the AlkEthOH and PhEthOH sets where all fractions are equal to one. As with SMARTY, we were still partially successful for MiniDrugBank given that we find the majority of fragment types.

SMIRKY is able to recover all smirnoff99Frosst types in AlkEthOH

As with SMARTY, AlkEthOH provides a useful toy example for testing our methods, and SMIRKY is successful based on total scores for all fragment types. In smirnoff99Frosst, there are nonbonded parameters corresponding to the same hydrogen parm99/parm@frosst atom types discussed above (Section 2.4.1). To distinguish all of these types, multiple beta position atoms must be identified. Thus, AlkEthOH is a good test set, although there are only a few fragment types in each category. SMIRKY discovers fragment type SMIRKS for all five bond, four angle, 11 proper torsion, and eight nonbonded fragment types required to type AlkEthOH with smirnoff99Frosst (Table 2.5). SMIRKY is able to generate sets of SMIRKS patterns which achieve a total score of 100% for all smirnoff99Frosst fragment types in AlkEthOH (Table 2.4). An average total score of > 90% at multiple temperatures provides further evidence of SMIRKY’s success (Table 2.6).

While SMIRKY is quite successful in general, its success is not universal at all temperatures. At moderate temperatures, SMIRKY is able to discover SMIRKS patterns complex enough to agree with the fragment typing used in smirnoff99Frosst, achieving a successful total score of 100%. But, as is expected with an MC algorithm, we achieve relatively poor scores both at T = 0 and at the high temperature of 0.1. However, at the intermediate temperature of

45 Temperatures 1.0 0.001 0 0.9

0.8 0.1 0.7

Total score 0.6

0.5

0.4

0.3

0 2000 4000 6000 8000 10000 Iterations

Figure 2.14: Total score versus iteration for SMIRKY sampling torsion fragment types in AlkEthOH. At a temperature of zero (pink line), SMIRKY works solely as a stochastic optimizer and gets stuck in a local optimum. When temperatures are too high, such as 0.1 (gray line), the probability of accepting moves that cause a decrease in score is very high, leading to dramatic changes in score throughout the simulation. At more moderate temperatures, such as 0.001 (blue line), a total score of 100% is eventually achieved.

46 AlkEthOH Average Score (%)* Temperature Bond Angle Torsion Nonbonded 1 86.7 ± 0.3 67.4 ± 0.3 57.4 ± 0.4 58.6 ± 0.2 0.1 91.8 ± 0.2 77.2 ± 0.4 68.8 ± 0.6 66.1 ± 0.2 0.01 99.66 ± 0.07 94.0 ± 0.4 90.9 ± 0.8 89 ± 1 0.001 99.95 ± 0.01 97.4 ± 0.7 97.7 ± 0.4 91 ± 2 0.0001 99.92 ± 0.03 97.9 ± 0.7 98.3 ± 0.4 93 ± 1 1e-05 99.93 ± 0.03 97.9 ± 0.7 97.3 ± 0.9 89 ± 2 1e-06 99.94 ± 0.02 97.9 ± 0.6 98.3 ± 0.4 87 ± 2 0 99.94 ± 0.02 94.6 ± 0.9 93 ± 1 76 ± 1

Table 2.6: Average of average total scores of 10 SMIRKY simulations with AlkEthOH. We took the average score at each temperature for each of our 10 simula- tions (10,000 iterations each), then averaged this across all 10 simulations. * - uncertainties were estimated by the standard error over all 10 trials.

0.001, SMIRKY eventually generates torsion SMIRKS with a 100% score and maintains that high score (Figure 2.14). In addition to the examples depicted here, multiple simulations at the temperatures of 10−6, 10−5, and 0.0001 also reached a 100% total score. The fact that multiple temperatures reach a total score of 100% indicates we have been successful in automatically sampling the chemical perception trees for torsion parameters in AlkEthOH. Details for all simulations are included in the Additional Information.

PhEthOH demonstrates the complexity of SMIRNOFF parameters

PhEthOH adds a small amount of complexity relative to AlkEthOH, including aromatic rings in addition to alkanes, ethers, and alcohols. It requires a total of 10 bond, six angle, 16 torsion, and nine nonbonded smirnoff99Frosst parameters. As with AlkEthOH, we performed 10 SMIRKY simulations at each temperature for each fragment type. This small increase in complexity also increased the difficulty of matching all reference types simultaneously, as shown by the drop in average total score for all fragment types (Table 2.7). For bonds, angles, and nonbonded fragment types, SMIRKY is able to achieve a 100% total score; however, the same could not be said for torsions (Table 2.4). However, SMIRKY is able to find a 100%

47 (a) (b)

Figure 2.15: Frequency of success for SMIRKY simulations with AlkEthOH and PhEthOH. For torsion fragments in both AlkEthOH (a) and PhEthOH (b), we plot the fraction of simulation time where each reference fragment type has a torsion SMIRKS pattern that receives a partial score of 100%. For AlkEthOH, at temperatures below 0.001, all of the reference fragments spend at least 50% of the simulation time with a 100% partial score. However, for PhEthOH, many of the reference fragments spend a significant amount of simulation time without a 100% partial score.

partial score for all torsions at most temperatures (Table 2.5).

A 100% partial score is evidence that SMIRKY can generate the SMIRKS of the same chemical complexity found in the smirnoff99Frosst. As with SMARTY, we examine the partial score for each reference torsion type and consider the fraction of iterations spent at 100% (Figure 2.15). For all but one reference torsion type (t2), SMIRKY is able to achieve a partial score of 100% for at least a fraction of time at all temperatures. It does achieve a 100% partial score for t2 at most temperatures, only failing at temperatures of 0 and 0.1. However, PhEthOH is slightly more challenging, as evidenced by the fact that heat map colors are lighter in general.

Unlike in AlkEthOH, SMIRKY was not able to achieve a total score of 100% when sampling torsions in PhEthOH (Table 2.4). PhEthOH only has six more torsion types, yet these make it substantially more challenging to find and keep sufficiently complex SMIRKS. In Section 2.4.1, we estimated the number of moves required to generate certain SMARTS patterns.

48 PhEthOH Average Score (%)* Temperature Bond Angle Torsion Nonbonded 1 80.0 +/- 0.4 61.3 +/- 0.3 49.8 +/- 0.5 46.8 +/- 0.2 0.1 85.0 +/- 0.3 69.5 +/- 0.5 55.5 +/- 0.5 54.4 +/- 0.2 0.01 97.5 +/- 0.1 88.5 +/- 0.9 83.3 +/- 0.7 83 +/- 2 0.001 99.73 +/- 0.06 94.8 +/- 0.4 91.1 +/- 0.9 91 +/- 2 0.0001 99.78 +/- 0.04 95.5 +/- 0.5 93.0 +/- 0.5 90 +/- 3 1e-05 99.83 +/- 0.03 95.8 +/- 0.4 91.3 +/- 0.8 85 +/- 3 1e-06 99.83 +/- 0.04 96.8 +/- 0.6 92 +/- 1 90 +/- 3 0 99.7 +/- 0.1 92.8 +/- 0.8 87 +/- 2 71 +/- 3

Table 2.7: Average of average total scores of 10 SMIRKY simulations with PhEthOH. We took the average score at each temperature for each of our 10 simulations (10,000 iterations each), then averaged this across all 10 simulations. * - uncertainties were estimated by the standard error over all 10 trials.

When considering fragment types with more atoms, the combinatorial problem grows further since there are multiple atoms and bonds that require decorators. The lower frequency of 100% partial scores for torsions in PhEthOH is our first example of the implications of this combinatorial problem. For PhEthOH, we are able to generate SMIRKS for all reference torsions during at least some fraction of most simulations. However, if we are going to effectively sample chemical perception trees for the development of force fields, future move proposals will need to move through chemical space more efficiently. We will examine this problem more closely when analyzing SMIRKY’s results on MiniDrugBank (Section 2.4.2).

Future move proposal engines could be made more efficient by taking advantage of basic chemical information. For example, if a double bond was already specified between a carbon and another atom, that carbon atom cannot have four bonds; however, as currently imple- mented, SMIRKY will occasionally propose moves which attempt to add a “X4” decorator to that carbon. We could take this concept one step further by learning which decorators can productively be applied to which atoms. For example, hydrogen can only have one bond and it is always single, so adding decorators to a hydrogen atom or its connecting bond will never usefully impact the separation of fragment types. Currently, SMIRKY makes many naïve moves, wasting many of its iterations generating SMIRKS patterns that cannot help

49 separate fragments for these and other reasons.

MiniDrugBank is a significantly more challenging molecule set

MiniDrugBank covers significantly more chemical space than either of the other molecule sets with 73 bond, 34 angle, 136 proper torsion, and 26 nonbonded parameters from smirnoff99Frosst. It uses all parameters employed in molecules from the DrugBank database that can be atom typed by the parm99/parm@Frosst force field, as described in Section 2.3.4. This is a more complex set, so despite considerable success, we are often unable to obtain a total score of 100%. Specifically, we were unable to find a total score of 100% for bonds, angles, torsions or nonbonded fragment types with 10,000 iterations. We also saw a significant decrease in the average total score for all fragment types during our ten simulations (Table 2.8) relative to other molecule sets. For this reason we repeated the simulations with 50,000 and then 100,000 iterations. SMIRKY still did not find a total score of 100% during the longer simu- lations. However, overall, SMIRKY was partially successful for MiniDrugBank. We achieve total scores over 90% for nonbonded, bond, and angle types for the highest scoring itera- tions (Table 2.4). The highest total score for torsion types is 76%, which is still promising considering the large combinatorial problem.

On combining the results from every iteration in all simulations (10,000, 50,000, and 100,000 iterations at all temperatures) we find a 100% partial score for the majority of smirnoff99Frosst reference fragment types (Table 2.5). Heat maps for MiniDrugBank, like the examples shown for torsions in AlkEthOH and PhEthOH (Figure 2.15), are available in the Supporting Infor- mation. There are five bond, two angle, 19 torsion, and one nonbonded reference fragment types where SMIRKY never finds a 100% partial score. Potential reasons for missing these fragment types are summarized in this section and discussed in further detail in the Sup- porting Information. While it would be ideal to find all the fragment types, SMIRKY is able to find SMIRKS patterns that agree with 91% of the smirnoff99Frosst fragment types,

50 indicating that we do recover a great deal of the target chemistry. MiniDrugBank Average Score (%)* Temperature Bond Angle Torsion Nonbonded 1.0 67.0 ± 0.4 50.3 ± 0.3 32.7 ± 0.4 35.9 ± 0.3 0.1 70.3 ± 0.7 53.5 ± 0.5 34.9 ± 0.8 44.0 ± 0.4 0.01 82.9 ± 0.4 70 ± 1 47 ± 1 77 ± 1 0.001 90.1 ± 0.6 76.4 ± 0.9 55 ± 1 87 ± 2 0.0001 91.6 ± 0.4 81.4 ± 0.7 56 ± 1 86 ± 2 1e-05 91.9 ± 0.5 81 ± 1 54 ± 2 86 ± 1 1e-06 90.7 ± 0.6 77 ± 1 60 ± 1 86 ± 1 0 90 ± 1 81.7 ± 0.9 60 ± 2 79 ± 2

Table 2.8: Average of average total scores of 10 SMIRKY simulations with MiniDrugBank. We took the average score at each temperature for each of our 10 simula- tions (10,000 iterations each), then averaged this across all 10 simulations. * - uncertainties were estimated by the standard error over all 10 trials.

As with SMARTY, there are a few fragment types that SMIRKY never recovers. In the Supporting Information we provide information about these missing fragment types, explor- ing in more detail why some are so challenging and why others are undiscoverable with SMIRKY’s current algorithm. In the following paragraphs we summarize the categories where SMIRKY’s algorithm could be improved in order to discover these missing types.

The order of the fragment type hierarchy and relationship between parent and child types could help explain some missing types. The two missing angle types, a6 (“[*:1]∼;!@[*;r3:2] ∼;!@[*:3]”) and a12 (“[*:1]∼;!@[*;r5:2]∼;@[*;r5:3]”), highlight the importance of where newly generated SMIRKS are placed in the hierarchy. In smirnoff99Frosst these pat- terns always match carbon atoms in 3- and 5-membered rings, but atoms 1 and 3 can be any element, meaning it is easy to lose progress made toward these two patterns when patterns lower in the hierarchy also match the ring carbon. SMIRKY requires that when a child SMIRKS pattern is created the parent SMIRKS still matches at least some molecules. In most cases this prevents the creation of unnecessarily specific SMIRKS, but in some cases makes it impossible to OR decorators together. For example, we would be unable to find the exact pattern for n7 (“[#1:1]-[#6X4]∼[*+1,*+2]”) because creation of a child with a second OR

51 type on the beta atom will always empty the parent. This is also true for many of the other 19 missing torsions. SMIRKY is not trying to discover the exact SMIRKS patterns used in smirnoff99Frosst so it is possible our algorithm could still succeed finding SMIRKS matching these reference types using a different combination of moves. However, allowing moves to combine SMIRKS decorators could make distinguishing this type of chemistry easier. An- other possible solution is allowing a child to completely replace a parent when the generated SMIRKS matches a larger section of chemical space, but not when it types the same group of fragments the parent typed.

The combinatorial problem that occurs in SMARTY is exacerbated in SMIRKY since mul- tiple atoms may require decorators or alpha or beta substituents. It is difficult to identify a specific reason SMIRKY does not reach a 100% partial score for some of the torsion types and the five missing bond types. This suggests that the combinatorial problem described above (Section 2.4.2) is the most likely source of failure. Thus it seems likely that chemical perception sampling tools in future force field parameterization will need improved move sets with higher acceptance rates.

To estimate the size of the search problem posed by torsions in SMIRKY, we considered an example reference torsion t78 (“[*:1]-,:[#6X3:2]=[#7X2:3]-[*:4]”), a deceptively simple example where SMIRKY never reaches a 100% partial score. SMIRKY simulations for torsions are initialized with the center two atoms specified and no bond information provided, so the initial torsion relevant to t78 is “[*:1]∼[#6:2]∼[#7:3]∼[*:4]” — that is, a carbon atom connected by any bond to a nitrogen atom. For the purpose of this exercise we will assume we want to generate this exact same SMIRKS; while that is not actually the goal of SMIRKY, it is a helpful example to illustrate the scope of the combinatorial problem. In this case, the probability of adding any one particular atom or bond decorator is approximately 0.01. In t78, there are an additional two atom decorators (X3 on carbon and X2 on nitrogen) along with four extra bond decorators. For each SMIRKY step, one torsion type from the

52 working set is chosen (this is the parent type) and then the child fragment is created by making changes to this parent type (Figure 2.3). In order to generate this particular SMIRKS we also need to include the probability of choosing the correct parent type, which is about 0.02. Even if we assume the order of adding each decorator is unimportant, it will typically take SMIRKY over 1 billion steps to generate this exact SMIRKS pattern.

One way to simplify the combinatorial problem in SMIRKY or future packages for sampling chemical perception trees would be creating a more efficient move set. As discussed in Section 2.4.2, SMIRKY currently considers all possible decorators for the atoms and bonds, whether or not they make chemical sense, and thus wastes considerable time proposing patterns that do not match any molecules. Considering the low probability of picking a given decorator, the fact that SMIRKY is able to find SMIRKS patterns which match 120 (88%) of the reference torsions is further evidence that we have been fairly successful in sampling the chemical perception tree effectively. However, if we could increase the probability of picking good decorators, we could decrease the number of moves required to sample the requisite chemistry. Future progress of these tools should leverage moves which only employ chemically reasonable combinations of decorators. For example, a double bond decorator should never be allowed next to a tetrahedral carbon which only has single bonds. This information would not necessarily need to be provided by a human expert; instead, molecule sets used for training could be used to extract this information.

Overall, SMIRKY’s ability to find over 90% of the smirnoff99Frosst fragment types in MiniDrugBank shows it can automatically sample suitable chemical perception trees. De- spite the shortcomings discussed above, SMIRKY is clearly capable of generating complex chemical perception like that historically generated by hand.

53 2.5 Conclusions

SMARTY and SMIRKY have proven capable of automatically sampling chemical perception trees relevant to general small molecule force fields. Both of our tools were tested first with relatively simple sets (AlkEthOH and PhEthOH) and then using MiniDrugBank, which provided a test set of molecules encompassing the diversity of the DrugBank database. SMARTY generated SMARTS patterns which resulted in a 93% total score for MiniDrugBank, and 100% for AlkEthOH and PhEthOH. SMIRKY achieved a total score of over 90% for nonbonded, bond, and angle fragment types in MiniDrugBank.

In some cases, the significant chemical complexity encoded in certain existing atom types limits our ability to find these types via individual SMARTS patterns. This may in part be because of overly complex atom typing, which, in traditional force fields, needs to encode the chemistry required for all force field parameter types simultaneously. Switching to direct chemical perception may allow for much more straightforward parameter assignment [82].

Our SMIRKY tool for working with direct chemical perception also had considerable success. SMIRKY generated SMIRKS patterns which capture more than 90% of the fragment types in smirnoff99Frosst for MiniDrugBank and 100% for AlkEthOH and PhEthOH. Coverage was less than perfect in part because there is a significant combinatorial problem in sampling chemical perception trees. This combinatorial problem is most pronounced in SMIRKY’s re- sults on torsions, where many atoms and bonds may require multiple decorators. The ability to automatically recover the chemical perception in these existing force fields is a promising first step toward sampling over chemical perception as part of force field parameterization.

SMARTY and SMIRKY are prototypes created as a part of the Open Force Field Initia- tive [52], which aims to create open source parameterization tools that can provide accurate force fields without human experts being required in the parameterization process. The over- all initiative is expected to have a significant impact on the quality of biophysical modeling

54 and molecular design for a variety of fields, and facilitate force field science.

Historically, many force fields used atom types, a form of indirect chemical perception, where all atom types were generated by hand by an expert. As discussed in the Introduction, the complexity of these atom types has been a major contributor to the difficulty of force field development and a barrier to cross-comparing force fields. The new SMIRNOFF format, with direct chemical perception, is an important step forward, but as currently implemented still uses hand-encoded SMIRKS patterns to assign parameters. In order to automate force field development, the generation of these SMIRKS patterns must be done automatically. SMARTY and SMIRKY are able to generate SMARTS and SMIRKS with comparable chemical complexity to parm99/parm@Frosst and smirnoff99Frosst. These tools are a promising first step to determining chemical perception for general force fields without relying on human expertise.

While SMARTY and SMIRKY both show considerable promise, they both face challenges with the size of the combinatorial search problem, suggesting room for further improvement. The combinatorial problem becomes especially important for fragment types involving the most potential atoms and decorators — angles and torsions. One step to improving perfor- mance could include a better optimization algorithm, such as simulated annealing. Here, however, the sampling problem is exacerbated by the fact that move proposals are not neces- sarily chemically sensible, so future work will need to make more reasonable chemical moves to improve efficiency. Before introducing a new optimization algorithm, the move set needs to be improved. Particularly, instead of using any SMARTS decorator, these tools should take advantage of basic chemistry. For example, no molecule will have a tetrahedral carbon with a double bond, so such a move should never be proposed.

Our ultimate goal is to move past comparisons to existing force fields and to develop new force fields which hopefully provide improved accuracy. While SMARTY and SMIRKY are both promising examples of tools which sample over chemical perception trees, our work here

55 used them in the context of existing force fields. Longer-term, they will instead be used to help sample over chemical perception as well as force field parameters, since our real goal is to improve force fields. New scoring functions will be used to evaluate SMIRKS patterns and parameters in the SMIRNOFF format compared to experimental and/or quantum chemical reference data. These tools will allow the development of next generation force fields in a completely automated way.

56 Chapter 3

Challenging molecular test cases for implicit solvation

3.1 Abstract

Implicit solvent hydration free energy methods play an important role in molecular modeling of biomolecular structure and dynamics. However, when comparing them to explicit solvent hydration methods, the lack of accuracy is still a concerning. In order to evaluate and potentially improve accuracy of hydration free energy calculations on the FreeSolv database, we used a hybrid method consisting of GBOBC calculations and a reprocessing method using Semi-Explicit Assembly (SEA). During the calculations, we found challenging molecules that were studied more carefully in order to understand their behavior. This work showed that the effective Born radii calculated in GBOBC method are potentially the cause of inaccuracy on some specific molecules and this needs to be take into account when using such methodologies in molecular modeling studies. Overall, this work provides a set of molecules which will be important tests for development of future GB models.

57 3.2 Introduction

Pharmaceutical drug discovery is an extremely expensive and time-consuming process [7, 112, 113, 114]. A substantial fraction of this effort is at the early stages of the discovery process when researchers search for molecules which bind to a biomolecular target with sufficient affinity and suitable properties to become a pharmaceutical drug [7, 8, 9, 10].

Hydration free energies are key for many important physical and biological processes, in- cluding binding, absorption, protein-protein interactions, protein folding, and membrane formation [115, 9, 116, 14, 18]. They are therefore also quite important for early stage drug discovery, since any kind of molecular simulation done in water requires a reasonably accurate description of the hydration free energy [117, 118, 119, 120, 121]. For example, tak- ing into account a protein-ligand binding calculation, ultimately the distribution of states sampled during the molecular dynamic simulation is related to the free energies, so if the solvent model is off it potentially means that states will no longer be sampled in the correct proportions, leading to inaccuracy of the binding affinity [118, 20, 122, 123, 124].

One potential way to dramatically accelerate binding or large system calculations is to modify the treatment of water in the molecular simulations employed [125, 126, 127, 128, 129, 130]. Rather than simulating water molecules explicitly (which can require the bulk of the computer time), so-called “implicit” solvent models can treat water much more quickly and rapidly, increasing simulation efficiency by a factor of 10 or more in many cases [131, 132, 133, 125, 134, 135, 136, 137]. However, most implicit water models available so far come at a substantial cost in accuracy [138, 139, 140, 141, 142, 133, 126, 143].

In most implicit solvent models, the solvent is treated as a structureless continuum with certain dielectric and interfacial properties [144, 131, 145, 146, 147, 148, 127, 149, 150]. These methods provide the electrostatic forces of atoms and can be used in molecular dynamics (MD) simulations. One model vastly used is called generalized Born (GB) [151, 152], which

58 makes analytical approximations of the Poisson equation (PE) theory for the electrostatic term (Gel) of the solvation free energy estimated by equation 3.1 [131, 145, 153, 148, 154, 127, 134]. Each atom in the molecule is treated as a sphere with a radius and partial charges, and it has a dielectric constant equals to 1. The molecule is then surrounded by a high dielectric value solvent of 80 (for water at 300 K). This methodology is relatively simple and computational efficient [153, 155, 127].

Gsolv = Gel + Gnp (3.1)

The GB model was established to approach the results of PB with lower computational cost and uses the so-called effective Born radii [151, 145, 153]. The effective Born radii is that radius that, when inserted in the Born formula, would give the same electrostatic solvation free energy as Poisson equation theory would provide [145, 127]. So, in other words, with “perfect” atomic radii, the GB model should be sufficiently accurate to reproduce the PB model. The effective Born radii needs to be recomputed every time the conformation changes. And a good agreement between GB and PB can only happens if the effective Born radii from GB match those computed using PB method [145, 127].

In order to improve the lack of accuracy of the implicit solvent models over the explicit solvent, researchers have explored hybrid methods [156, 157, 158, 159, 154, 160]. Hybrid explicit/implicit solvent models proved to be faster than explicit solvent methods, but the accuracy was improved modestly [156, 140, 157, 161]. Hybrid methods using molecular mechanics and quantum mechanics in implicit solvent had improvement of the accuracy, but the precision was still inadequate [158]. Recently, similar to hybrid solvation models, a solvation method called Semi-Explicit Assembly (SEA) was developed [162]. The SEA approach works by precomputing properties of water solvation shells around sample spheres,

59 then assembles a solute’s solvation shell [162]. This model improves the capture of various physical aspects of solvation, and is almost accurate as explicit solvent simulations [162].

In this paper, we evaluate the accuracy of a different kind of hybrid method using implicit solvent and semi-explicit assembly methods. In particular, we analyze challenging molecular cases for the implicit solvent model, probe the nature of the implicit solvent model’s defi- ciencies, and assess whether our hybrid approach can remedy these deficiencies. We hope this study can provide assistance for those who use implicit solvent, especially in the context of free energy calculations, and help the development of future implicit solvent methods.

3.3 Methods

3.3.1 FreeSolv Database

FreeSolv is a hydration free energy database for neutral compounds and can be used to test new solvation free energy methods [1]. This database contains experimental and calculated hydration free energy values, SMILES strings, PubChem compound IDs, IUPAC names, and calculated enthalpies and entropies of hydration for 642 small organic molecules. Calculated hydration free energies in the database are calculated using explicit water, so FreeSolv is a good test set for implicit solvent methods, because we can easily compare our calculated values with explicit solvent results.

For this study, we used the structures files provided on github (https://github.com/ MobleyLab/FreeSolv). Because our simulations were done in GROMACS, we used the GROMACS format topology and coordinate files from the database for all 642 molecules. For our further investigation in the 1,4-dimethylpiperazine case, we re-generated the AM- BER topology and coordinate files for this particular molecule and 1-methylpyrrolidine using

60 mmtools [163] and OpenEye Toolkits [91], and AM1-BCC charges and GAFF atom types were assigned. For 1-methylpyrrolidine, we also used amber to from openmoltools to convert the topology and coordinate files to GROMACS supported format.

3.3.2 Implicit Hydration Free Energy Calculation using GROMACS

The implicit solvent free energy calculations were performed on all FreeSolv database using the GROMACS v.4.6.7 [164] simulation package using the GBOBC solvation model [153]. First, the energy minimization and the equilibration of all molecules was performed. The equilibration was set at 100 ps in the NVT ensemble. In sequence, the production was performed and we stored 5000 full precision trajectory snapshots (5 ns); this number was important for the next step of this project. Those steps were performed in water (GB) and gas phase (vacuum) for each ligand. Full details of our protocol are given in the Supporting Information in the form of run input files (mdp format). We reprocessed each of the topology and coordinate files generated from water and vacuum simulations in water and in vacuum to obtain the potential energy differences. The GBHCT model [165] was also used to analyze the particular case of 1,4-dimethylpiperazine.

We reprocessed the states (water and vacuum) for each molecules, where the trajectories from the first water simulation were used to run with vacuum and water input files, sep- arately. The same was valid for the trajectories from vacuum simulation. The Bennett Acceptance Error (BAR) [166] method was used in order to get the minimum uncertainty estimate of free energy difference between the two thermodynamics states (GB and vacuum). This method, in contrast to explicit solvent hydration free energy calculations, can be done without intermediate lambda values [167]. The potential energies of simulation snapshots of each state were evaluated, and then these were used as input to calculate the free energy difference between states GB and vacuum.

61 3.3.3 We separate the molecules with positive electrostatic term

The electrostatic term (Gel) captures all the electrostatic effects of the solvent. Gel is the interaction between the charge distribution and its reaction potential, the potential induced by the charge distribution in the presence of the dielectric boundary at the solute-solvent interface 3.2. The GB model approximates the Poisson equation (PE), calculating the elec- trostatic term by:

1 X qiqj Gel = (1 − ) (3.2)  f GB(r ,R ),R ) ij ij i j

where qi is the atomic charge, ri is the position, and Ri is the effective Born radii of the atoms i and

GB 2 2 1/2 f = [rij + RiRjexp(−rij/4RiRj)] (3.3)

The electrostatic term should always be negative. This is because Gel relates to the in- teractions between the solute and a set of image charges induced by polarization in the surrounding dielectric. Such interactions must always be attractive resulting in favorable (negative) electrostatic component of the hydration free energy. So, in order to analyze the solvent effect of the molecules, we calculated the electrostatic component of the hydration free energy of the molecules using g_energy from GROMACS v.4.6.7 simulation package.

The molecules that presented positive Gel values were further investigated.

62 Figure 3.1: Thermodynamic path for calculating the ∆GSEA for our hybrid method. Svac and SGB are the solvation energies using GROMACS in vacuum and GB, respectively. SSEA is the solvation energy from SEA using the 5000 trajectories from the implicit solvent hydration free energy using GROMACS. ∆GSEA was calculated using ∆∆GGB→SEA and ∆Ghyd (implicit solvent).

3.3.4 Reprocessing of Implicit Solvent Hydration Free Energy using

SEA - Hybrid model

The next step was reprocessing the implicit solvent hydration free energy data from previous simulation using SEA. The simulation was performed using the 5,000 trajectories snapshots in water from the previous simulation of each molecule. We used the default configurations in the SEA solvate program [162]. The Zwanzig relationship, or the exponential averaging (EXP), was used to compute the free energy to change from the GBOBC implicit solvent model to the SEA solvent (∆∆GGB→SEA). Then, ∆∆GGB→SEA was applied as a correction to computed hydration free energies to obtain values for ∆GSEA. Figure 3.1 illustrates how

∆GSEA was calculated.

EXP is a two state method, and although estimating free energies is straightforward, it is not very efficient and can only be applied to some specific situations — especially simulations are required in both states of interest. Here, SEA is not suitable for dynamics, so we were unable to run a simulation of the SEA solvated state. However, we can compute the energy difference using EXP by Equation 3.4. Usually the EXP doesn’t work well because of the difficulty of getting enough data for exponential averaging to yield reliable results. In our case, we

63 were calculating the ∆∆GGB→SEA reprocessing the GB solvation energies from GROMACS. In other words, SEA calculated the energies from the same configurations sampled in our implicit solvent method using GROMACS, producing a good phase overlap between the two states. EXP was used in previous study where GB simulations were reprocessed using two GB models in Amber, showing a very good phase space overlap [168]. Here, error analysis also confirms our data is adequate to obtain accurate free energy estimates.

−1 −β∆E0→1 ∆F = −β lnhe i0 (3.4)

We were interested in understanding when SEA reduced errors in calculated hydration free energies (compared to explicit solvent) and by how much. Therefore, we also compared the implicit solvent calculation by GROMACS and our hybrid method implicit solvent/SEA with the explicit solvent model used in previous study as a standard [1]. Thus, we calculated the change in the error due to SEA using the equation 3.5.

∆Err = (| ∆Gimplicit − ∆Gexplicit |) − (| ∆GSEA − ∆Gexplicit |) (3.5)

Here, a large positive value means SEA reduces the error substantially, while a value near zero means little change and a negative value means SEA makes accuracy worse.

3.3.5 1,4-dimethylpiperazine study case

During our analysis of the GROMACS implicit solvent free energy simulations, seven molecules presented positive values of the electrostatic term. The molecule with the largest positive

64 value, 1,4-dimethylpiperazine, was picked for deeper investigation. We applied different implicit solvent models in different software in order to understand the behavior of 1,4- dimethylpiperazine. Our intention was to investigate if the problem with the odd hydration free energy and electrostatic component values of this specific molecule was coming from a bug in GROMACS or was related to the OBC model.

We Tested Amber Implicit Solvent Models for 1,4-dimethylpiperazine

We performed implicit solvent hydration free energy calculation of the 1,4-dimethylpiperazine in Amber 15 [169, 170]. Topology and coordinate files were generated by tleap of Amber 15 [169] using mmtools (available at https://github.com/choderalab/mmtools). The first step, we used ‘sander’ of Amber 15 to run a MD simulation and to produce trajectories. 500 steps of minimization was done in implicit solvent using the default radii set up by LEaP for the GB model (igb=1). The MD simulation of 100 ps was performed using the same GB model to get some trajectories to be analyzed. The free energy was calculated using MM-PBSA Python script [171] implemented in Amber12 with HCT (igb=1), OBC (igb=2), OBC-2 (igb=5), and GBn (igb=7) models.

We used OpenMM to check its implicit solvent models for 1,4-dimethylpiperazine

We also used OpenMM [172] to test its implementation of several implicit solvent mod- els on 1,4-dimethylpiperazine. We used openmoltools (available at https://github.com/ choderalab/openmoltools) and parmed (available at https://parmed.github.io/ParmEd/ html/index.html) to perform a simple implicit solvent evaluation with OpenMM using the topology and coordinate files from Amber. The Langevin integrator was set up with temperature equals to 300K. We tested the HCT (Hawkins-Cramer-Truhlar) GBSA model [173] (which corresponds to igb=1 in AMBER), the OBC1 (Onufriev-Bashford-Case) GBSA

65 model [153] using the GB OBCI parameters (which corresponds to igb=2 in AMBER), and the OBC2 GBSA model [153] using the GB OBCII parameters (which corresponds to igb=5 in AMBER).

Testing a new GB flavor called GBNSR6 for 1,4-dimethylpiperazine and a series of molecule types

GBNSR6 [174] is a newer implicit solvent method that uses Generalized Born (GB) model, where an integration called “R6” (over the Lee-Richards molecular surface) computes the effective Born radii numerically [174]. The ALPB (Analytical Linearized Poisson-Boltzmann)

model is used to calculate the polar component of solvation energy (∆Gel). GBNSR6 is part of Amber 15 [169], and we used the default values for the dielectric constant of the solute region, the ionic strength of the solution, the uniform offset to the (inverse) effective radii, and the surface tension parameter. The implicit solvent dielectric constant for the solvent was 80.0. The grid and arc resolutions were set to 0.3 Å and 0.1 Å, respectively. The GB equation used was the ALPB correction. We also computed the nonpolar solvation energies. The method used here is described in a previous study [175], and Onufriev suggested this approach might solve issues we had encountered with positive nonpolar components.

3.4 Results and discussion

In this study, we also tested a hybrid method to calculate the hydration free energy of molecules, where we first ran an implicit solvent calculations and then reprocessed them in a semi-explicit assembly method. We tested our methodology using the FreeSolv database and compared our results to explicit solvent simulations. During our analysis, we found some challenging cases for the implicit solvent hydration free energy used and the cases are

66 (a) (b)

Figure 3.2: Implicit (yellow dots) and hybrid solvent (lilac dots) versus explicit solvent [1] hydration free energy for the FreeSolv database molecules. (a) Linear regression (red line) analysis of implicit solvent model (GBOBC ) values (on the horizontal axis) and explicit solvent model values (on the vertical axis). The molecules that presented positive value of the electrostatic energy are shown in black triangles. The free energies distributions for the implicit and explicit solvent methods are shown on the top and the right side of the graph, respectively. The R2 is equal to 0.8. (b) Linear regression (red line) analysis of hybrid SEA values (on the horizontal axis) and explicit solvent model values (on the vertical axis). The free energies distributions for the hybrid SEA and explicit solvent methods are shown on the top and the right side of the graph, respectively. The R2 is equal to 0.95. discussed further (section 3.4.2).

3.4.1 Our hybrid method improved the overall accuracy of the hy-

dration free energies

First, we calculate the implicit solvent free energies using GBOBC model in GROMACS of 642 molecules from the FreeSolv database. We then compared the results of our implicit solvent method with explicit solvent hydration free energies from previous work [1]. The result is shown in Figure 3.2a.

67 We ran SEA using our stored snapshots from our implicit solvent calculation in water and then we calculated the exponential average to compute the change in solvation free energy from GROMACS implicit solvent to SEA solvent 3.1. In figure 3.2b, we can see how SEA hydration free energies performed compared to explicit solvent hydration free energies.

Now, comparing our results with previous explicit solvent work, the RMS error between GBOBC solvation and explicit solvent [1] was 2.00 kcal/mol, with a coefficient of determina- tion (R2) of 0.8033 ± 0.0004. SEA had the best agreement with the explicit solvent method with RMS error of 1.08 kcal/mol and a R2 of 0.9449 ± 1e-04. Regarding our computed implicit solvent model the r was 0.629 ± 0.003, when comparing to the computed explicit solvent model. Both GROMACS implicit solvent and SEA calculations yielded results with some correlation with explicit solvent values, but after reprocessing on SEA, we could see an improvement in the hydration free energy values.

In addition to comparing with explicit solvent values, a comparison with the experiment is also warranted. The statistical analysis of the results yields a root-squared mean (RMS) error between calculated and experiment of 2.56 kcal/mol and 1.81 kcal/mol, for the GB solvent model and SEA, respectively. The SEA’s results were close to the RMS error found on the same set in previous work in explicit solvent [1], which was equal to 1.53 +- 0.07 kcal/mol.

The free energies for the implicit and hybrid solvent methods presented different distributions when compared to the explicit solvent models. Both implicit and hybrid solvent methods did not present a high peak as observed in the explicit solvent model (Figure 3.3). The im- plicit solvent method, when comparing with the hybrid solvent method, presented a slightly shift to the right with more positive free energies (Figure 3.3 c). Thus, the hybrid solvent method presents a higher agreement with the distribution resulting from the explicit solvent calculations.

68 (a) (b) (c)

Figure 3.3: Hydration free energy distributions for explicit [1], implicit, and hybrid solvent hydration free energy calculations for the FreeSolv database. (a) Explicit (gray) and implicit (yellow) solvent hydration free energy distributions; (b) Explicit (gray) and hybrid SEA (lilac) solvent hydration free energy distributions; (c) Implicit (yellow) and hybrid SEA (lilac) solvent hydration free energy distributions;

We also calculated the electrostatic term of all 642 molecules using GROMACS, in order to analyze the solvent-solute electrostatic interaction. Seven molecules from the FreeSolv

database showed positive Gel values, and their structures as well as their Gel are shown in Figure 3.4. These molecules may have some commonalities; for example most have a six-membered ring, several have nitrogen, and the others have quaternary carbons.

Among the molecules with positive electrostatic component of the hydration free energy, one molecule stood out from the rest — 1,4-dimethylpiperazine (DMP). DMP was the molecule that showed the highest positive Gel value, which was 2.190 ± 0.008 kcal/mol for the GB solvent model in GROMACS. Because those positive values were unexpected, we decided to further investigate why this behavior was shown for those molecules, and especially DMP. This discussion is presented in section 3.4.2.

3.4.2 A closer look at the 1,4-dimethylpiperazine implicit solvent

model result

1,4-dimethylpiperazine showed a positive value of the electrostatic component of the hydra- tion free energy, and standing out from the other six molecules with positive Gel. In other

69 3,3-dimethylpentane (1r,4r)-1,4-dimethylcyclohexane (1R,2S)-1,2-dimethylcyclohexane

1,4-dimethylpiperazine

Gpol = 0.0007 +/- 0.0002 kcal/mol Gpol = 0.0109 +/- 0.0003 kcal/mol Gpol = 0.0291 +/- 0.0003 kcal/mol

2,2,5-trimethylhexane 1-methylpiperidine triethylamine

Gpol = 2.190 +/- 0.008 kcal/mol

Gpol = 0.0341 +/- 0.0003 kcal/mol Gpol = 0.768 +/- 0.005 kcal/mol Gpol = 1.631 +/- 0.005 kcal/mol

Figure 3.4: Molecules from the FreeSolv database that showed positive GB polarization values.

words, the DMP positive value was largest (2.190 ± 0.008 kcal/mol ) even among the seven

OBC molecules with positive Gel. The implicit solvent simulation in GROMACS using GB model returned a ∆G equals to 3.381 ± 0.003 kcal/mol for DMP. However, when we used our hybrid method, the hydration free energy of DMP was -3.766 ± 0.007 kcal/mol, which was significantly better and closer to the experimental and explicit solvent values (Table 3.2) of -7.6 and -7.87 ± 0.03, respectively. In order to understand this odd result in the implicit solvent hydration free energy, we decided to analyze carefully this molecule.

The effective Born radius is crucial for the accuracy of the implicit solvent hydration free en- ergies. Effective Born radii, also so-called “perfect” radii or just effective radii, are computed within the analytic GB approximation. In equation 3.2, the Ri should be the radius that gives the same electrostatic energy that PE theory would give when inserted into the Born equation and used to compute the energy of a molecule in a dielectric, where only a single atom carries a single charge at its atomic center; if “perfect radii” are used, both the Born and Poisson equations should yield the same values. The effective Born radii are determined

70 using an integral of the energy density of a Coulomb field over the molecular volume. Then, it is evaluated by a numerical integration or an analytical approximation.

Initially, we explored the idea that the Born radius might be incorrect for the nitrogen in DMP, especially given its relatively unique structure in FreeSolv. So, we changed the Born radius value for the nitrogen located in the topology file under the implicit solvent generalized Born parameter section. This value is related to the VdW radii of the atoms, which is then used to calculated the effective Born radii. We tested eight different radii values, ranging

from 0.170 nm to 0.200 nm, and analyzed the Gel and the nonpolar solvation energies after performing a single rerun. The result of this single run with different Born radius values is in Figure 3.5.

We found that the Gel energy becomes positive as the size of the nitrogen decreases. Radii

smaller than 0.176 nm have positive Gel energies. The standard value for the nitrogen radius used for DMP in the GBOBC implicit solvent calculation was 0.1625 nm, smaller than 0.176 nm.

We also investigated the effect of the size of hydrogen radius on the DMP. We varied the hydrogen radius from 0.125 nm to 0.200 nm. In Figure 3.5, the results suggested that positive values for the electrostatic component of the hydration free energy is shown when the radius is smaller than 0.153 nm. We could see the same pattern observed with the nitrogen here with hydrogen. However, it was unclear to us wether one or both of these radii were really the problem or simply obscuring a larger problem with the GB model.

In view of concerns about the GB model, we were interested to determine whether the peculiar and unphysical value for this compound was a flaw in this GBOBC model in general, or in its specific implementation in GROMACS, so we also repeated our calculations using several implicit solvent models in AMBER.

The results corresponding to the electrostatic energies and hydration free energies using

71 5.0 atom hydrogen 2.5 nitrogen

0.0

2.5

5.0

7.5

GB polarization energy (kcal/mol) 10.0

12.5

0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 VdW Radius (nm)

Figure 3.5: GB polarization energy for 1,4-dimethylpiperazine. We changed the radius size of the nitrogen (wine line) and hydrogen (tan line), separately, and calculated their GB polarization energies. Errors are shown in silver lines.

72 different IGB models and radii in AMBER are in Table 3.1. The hydration free energies were mostly far from the expected values. There was even one positive value equals to 0.1 kcal/mol for the electrostatic term in IGB=5 using mbondi radii (which is the model

equivalent to GBOBC model in GROMACS).

Table 3.1: Electrostatic (Gel) and hydration free energies (∆Gsolv) in kcal/mol for 1,4- dimethylpiperazine using AMBER in different solvation models and radii.

Radii IGB=1 IGB=2 IGB=5 IGB=7 Gel ∆Gsolv Gel ∆Gsolv Gel ∆Gsolv Gel ∆Gsolv bondi -6.5 ± 0.4 -4.7 -5.1 ± 0.5 -3.4 -2.28 ± 0.02 -1.2 -8.6 ± 0.3 -6.9 mbondi -4.44 ± 0.08 -2.7 -1.9 ± 0.2 -0.2 0.1 ± 0.2 1.9 -4.88 ± 0.07 -3.1 mbondi2 -6.6 ± 0.2 -4.8 -5.1 ± 0.2 -3.5 -3.1 ± 0.1 -1.3 -8.7 ± 0.2 -6.9

The same pattern observed in AMBER was shown with OpenMM, where IGB=1 (equivalent to HCT) returned an energy evaluation equals to -3.85 kcal/mol, IGB=2 (equivalent to OBC1) returned -1.15 kcal/mol, and IGB=5 returned 0.80 kcal/mol. Overall, these results indicate the problem with DMP is not due to a specific implementation of the GB model but seems to be a general feature of the selected GB model.

We also ran GBHCT model, approach of Hawkins, Cramer, and Truhlar [173, 165], in GRO- MACS and the result is in Table 3.2. The GBOBC method aims to return a closer approxi- mation of the true molecular volume than GBHCT [165]. This is because the effective Born radii on the GBOBC model are rescaled to account for the interstitial spaces between atom spheres missed by the GBHCT approximation. However, differently from what was observed in the GBOBC model, we saw a favorable hydration free energy for DMP using GBHCT (-0.529 ± 0.001 kcal/mol), suggesting the largest problem in this case was with the OBC model.

Finally, we tested the GBNSR6 model [175, 176]. GBNSR6 is a different kind of GB model,

where the effective Born radii Ri are computed in a different way (or flavor), increasing the accuracy and efficiency of the model [175]. With this flavor, the effective Born radii are

6 calculated numerically as a single r − ri integral over the Lee–Richards molecular surface [175]. The accuracy of GBNSR6 relative to the PB standard is unrelated to the choice of

73 input atomic radii [175, 176], rather than selecting different approximations depending on the size of the radius as in most other GB models. This means that, effectively, GBNSR6 has fewer parameters and is closer to the underlying PB model being approximated. Previous work has argued this difference is extremely important for obtaining accurate effective Born radii and accurate hydration free energies; while we have not examined that exhaustively, it certainly performs better for us here than the other GB models explored [176].

Running the GBNSR6 model, DMP showed a hydration free energy of -5.54 kcal/mol, which is closer to the experimental value than other implicit solvent models tested in this study. The electrostatic term value was -7.16 kcal/mol.: This reinforces the theory that the unfavorable hydration free energy of 1,4-dimethylpiperazine comes from an inaccuracy in the calculations of effective Born radii. Table 3.2: Comparison of results with experimental hydration free energies using the full free energy scheme (∆G) and several different implicit solvation models (kcal/mol).

Molecule Expt. Explicit GBOBC SEA GBNSR6 1,4-dimethylpiperazine -7.6 -7.87 ± 0.03 3.440 ± 0.001 -3.638 ± 0.01 -5.54 1-methylpiperidine -3.9 -3.47 ± 0.03 2.189 ± 0.002 -0.961 ± 0.006 -2.4 (1R,2S)-1,2- 1.6 1.69 ± 0.03 1.4723 ± 0.0007 2.419 ± 0.002 1.6 dimethylcyclohexane (1r,4r)-1,4- 2.1 1.92 ± 0.03 1.6572 ± 0.0007 2.569 ± 0.002 1.7 dimethylcyclohexane triethylamine -3.2 -1.96 ± 0.03 3.274 ± 0.002 0.60 ± 0.02 -1.3 3,3-dimethylpentane 2.6 2.59 ± 0.03 1.7550 ± 0.0008 2.649 ± 0.002 1.6 2,2,5-trimethylhexane 2.9 2.97 ± 0.03 2.0842 ± 0.0009 3.100 ± 0.002 2.0

3.4.3 Comparing series of molecule types

We also compared molecules that share similarities with DMP in order to understand why the implicit solvent model does not work properly for those molecules. For example, we would like to observe how removing one of the methyl groups from DMP, turning it into 1-methylpiperazine, will affect its electrostatic term and, consequentially, the hydration free

OBC energy. So, we get the hydration free energy results and the Gel of GB of other six molecules for comparison.

74 All molecules containing a methyl group on their nitrogen atoms showed positive Gel values for the GBOBC as shown in Figure 3.6 a. 1,4-Dimethylpiperazine, 1-methylpiperazine, 1- methylpiperidine, and 1-methylpyrrolidine presented a positive implicit solvent hydration free energy using GBOBC model (Figure 3.6 b). For all seven molecules in Figure 3.6, the SEA and the new GBNSR6 model improved the hydration free energy values. Being GBNSR6 the best model for those series of molecules.

When analyzing those series of molecule types, we could see the difference that the methyl group can play in their implicit solvent hydration free energy values (Figure 3.6). 1-Methylpyrrolidine

OBC showed a positive Gel for the implicit solvent hydration method GB and a positive hy- dration free energy value. When looking at the pyrrolidine, a 5-membered ring that does not have a methyl group attached to the nitrogen atom, we can observe a negative Gel and a negative implicit solvent hydration free energy. We could notice that molecules with methyl groups attached to their heterocyclic rings are likely affected by the lack of precision of the born radii calculation on those atoms. Losing the methyl group to make a secondary amine out of the tertiary amine showed to play an important role in the implicit solvent hydration free energy calculation on these ring systems.

Implicit solvent hydration free energy calculations are still a challenge for those types of molecules. Although the GBNSR6 model presented better results when comparing to the GBOBC and SEA, the results from GBNSR6 are considerable different from the explicit sol- vent model values for those molecules. Explicit solvent hydration calculations have detailed physical representations of the water, and they can have higher order entropic effects that are difficult to impossible for implicit solvent models to capture. In order to improve the implicit solvent models predictions, the GB radii would need to be modified in a way to capture the effects of those molecules.

75 OBC Figure 3.6: Hydration free energies for a series of molecules types. a) Gel from the GB in kcal/mol. b) Explicit solvent hydration (beige), implicit solvent (dark red), SEA(light lilac), and GBNSR6 (sand) hydration free energies for seven molecules. The electrostatic term for 1-methylpyrrolidine is 0.065 ± 0.005. No explicit solvent hydration free energy is shown for 1-methylpyrrolidine as it is not part of the FreeSolv database.

76 3.5 Conclusions

Hydration free energies are important in molecular simulations. Here, we present a new methodology using a hybrid method, where we first compute the hydration free energies of compounds in the FreeSolv database using an implicit solvent model and reprocess them using the semi-explicit assembly (SEA) model. Our method showed improved accuracy compared to hydration free energies computed with the GB model and showed relatively good agreement with the explicit solvent model from previous work [1] with RMS equal to 1.08 kcal/mol for the hybrid model and 2.00 kcal/mol for the GBOBC model.

We also found some molecules which posed particular challenges for the GBOBC model, even identifying some molecules which yielded positive values for the electrostatic component of solvation. We then explored the effective Born radii and other implicit solvent methods and implementations to understand the behavior. We found that the problem with the GBOBC model is not a software-specific problem, and that the same problem occurs in several simulation packages. Also, a series of molecule types (heterocyclic molecules containing nitrogen) showed the effects of losing the methyl group to make a secondary amine out of the tertiary amine in their Gel and hydration free energies.

Ultimately, we found that switching to an implicit solvent model with putatively more accu- rate effective Born radii (the GBSNR6 model [145, 175]) improved the calculated hydration free energies and gave results in more agreement with those from explicit solvent methods, suggesting that the way effective Born radii are calculated can impact the accuracy of the results.

Overall, our work highlights a small and interesting set of challenging molecules which need to be taken into consideration when developing new implicit solvent models. We found seven molecules which have unphysical positive values of the electrostatic component of their hy- dration free energies. These seven have structures that occur frequently in biologically and

77 pharmaceutically relevant molecules. Delivering implicit solvent models that are compu- tationally fast and accurate can contribute to effective results in the early stages of drug discovery. However, our work here shows that implicit solvent models need considerable im- provement still on challenging test cases like these, and a better treatment of effective Born radii can potentially help resolve these issues, at least for these specific molecules.

3.6 Acknowledgement

DLM and CZ appreciate financial support from the National Institutes of Health (1R15GM096257- 01A1, 1R01GM108889-01) and the National Science Foundation (CHE 1352608), and com- puting support from the UCI GreenPlanet cluster, supported in part by NSF Grant CHE- 0840513. CZ also appreciates the Brazilian Science without Borders scholarship (Capes – BEX 1865612-9). We also appreciate helpful discussions with Alexey Onufriev and Saeed Izadi (Virginia Tech), and his suggestion of looking at the GBNSR6 solvent model.

78 Chapter 4

Predicting conformational poses of the lissoclimides for translation inhibition using in silico docking and in cerebro analysis

4.1 Abstract

The lissoclimides are reported to be potent cytotoxins. In particular, chlorolissoclimide (CL) has shown to be a potent inhibitor of eukaryotic translation. Natural products inhibitors of eukaryotics protein synthesis have significant therapeutic potential to treat several hu- man cancers. The Vanderwal lab at UCI performed a synthesis-driven study, where they synthesized several lissoclimides in order to learn about their cytotoxicity, activity and the structural basis for translation-inhibition . The Mobley group was part of this investigation, where we used computational tools to guide the study and understand the structure-activity

79 relationship (SAR) among the lissoclimide family. This study lead to two scientific papers published in high impact journals where more information can be found [177, 178].

4.2 Introduction

The ribosome is a molecular machine responsible for protein synthesis which is critical to sustaining life [179, 180]. Given its importance in the cell, protein synthesis has been the target for some natural product inhibitors. In prokaryotes, the atomic structures of their ribosomes and new antibiotics that inhibits the protein synthesis provided a tool to study this process in bacteria [181, 182].

Cycloheximide (CHX) (Figure 4.1) is the most common laboratory reagent used as protein synthesis inhibitor and is also an inhibitor of the eukaryotic translation [183]. CHX works as a blocker of the elongation phase of eukaryotic translation, binding to the ribosome and inhibiting eEF2-mediated translocation [184].

Figure 4.1: Cycloheximide and Lactimidomycin, respectively, 2D structures.

80 The CHX was originally isolated from Streptomyces griseus, other product from the same family is lactimidomycin (LTM) (Figure 4.1) isolated from Streptomyces amphibiosporus [185, 186]. Both molecules have a glutarimide moiety in their structures. LTM has shown potent activity against the proliferation of some tumor cell lines, and has similar mechanism to CHX [186]. Lactimidomycin and cycloheximide bind in the E-site of 60S ribosome (Figure 4.2) and interferes to the translation process [187]. Nevertheless, it has been known that the compounds have different mechanisms of action, with cycloheximide binding together with a deacylated tRNA to block the translation and, whereas lactimidomycin blocks tRNA access to the E-site [2]. For both molecules, the interaction between the glutarimide ring and the G92, C93, and U2763 bases seems to be the most important for the inhibition [187].

Figure 4.2: Cycloheximide (magenta) and Lactimidomycin (purple) crystal structures in the binding pocket of the 60S ribosome from S. cerevisiae [2]. Hydrogen bonds are represented in red and black dashed lines for CHX and LTM, respectively.

The Vanderwal group (UC Irvine) has synthesized some molecules with similar structure to CHX and LTM in order to find antiproliferative activity in tumor cells [177]. Some of them such as haterumaimide J, chlorolissoclimide and their analogues have shown potent activity against some tumor cell lines.

81 In this study, we predicted likely binding modes for several new ligands (Figure 4.4) based in a combined in silico and in cerebro approach. We used docking calculations and shape overlays in order to help us understand the likely binding modes of some new molecules (Figure 4.4). The X-ray crystallographic structures of lactimidomycin and cycloheximide binding to ribosome Saccharomyces cerevisiae were determined in a previous study and were used as references in our study [2].

Figure 4.3: 2D structures of the ligands tested in this study and their activities. The molecules that were tested their 50% inhibitory concentrations (IC50) against P388 leukemia cells. and (B) the ones that were not test their activity.

Figure 4.4: 2D structures of the ligands tested in this study with no activities. The molecule structures that were not test in P388 leukemia cells to get their activity.

Regarding the compounds studied in this project, the major difference is based on the replace- ment of the glutarimide ring with a succinimide ring. However, the amount of hydrogen bond

82 donors and acceptors for the ring were kept the same. Moreover, our compounds particularly have a decalin ring in their structures, different from cycloheximide and lactimidomycin.

The probability of a given docking method correctly identifying the true ligand binding mode as its top scoring pose is usually low [188, 189]. One of the approaches that have been used to predict binding modes with better results is called consensus scoring [190]. Consensus scoring is a method which uses more than one scoring algorithm to predict binding mode and rank possible binders. This combination has shown better binding predictions than when a single docking/scoring method is applied alone [190]. Some studies have proposed a similar methodology, called consensus docking, where they compare poses from more than one docking method and reject those that are not sufficient similar (RMSD <= 2.0 Å), increasing the likelihood of correctly identifying the binding mode [191].

4.3 Methods

Inspired by this consensus docking approach, we used FRED, HYBRID (OpenEye OEDock- ing v.3.0.1) [91] and AutoDock Vina 1.1.2 [192] in order to analyze the best scored poses of each method. Since we do not have the reference crystal structure of those molecules, we could not compare their RMSD. The shape-similarity was also calculated using ROCS (OpenEye Scientific Software, Santa Fe, NM) [193] by comparing several conformations of the new ligands with the known ligand chlorolissoclimide. Thus, we assessed pose quality using an in cerebro approach, using crystallographic known ligands (lactimidomycin, cyclo- heximide, and chlorolissoclimide) as pose references.

83 4.3.1 We first prepared the structures for docking calculations

Three-dimensional (3D) chemical structures of haterumaimide J, chlorolissoclimide, and their respectively hybrid analogues were drawn using Maestro Elements 1.9.013 (Schrödinger Soft- ware) [194]. The crystal structure of ribosome complex with lactimidomycin and cyclohex- imide [2] retrieved from the protein data bank (PDB entry: 4U4R and 4U3U, respectively) was employed as the template.

4.3.2 Different docking methods were tested

Docking was performed using two different softwares, OpenEye OEDocking v3.0.1 and AutoDock Vina 1.1.2 [192]. For the first software, we used FRED (Chemgauss3 scoring) [195] and HYBRID (Chemgauss and Chemical Overlay scoring) methods [196, 2].

For OEDocking software, we wrote a script to perform all tasks starting from .pdb files of our ligands and crystal structures. Ligand conformer libraries were created in Omega 2.3.2 (OpenEye Scientific Software) [91]. Partial charges were assigned according to the Amber library (AM1-BCC). The ribosome structures (PDB code: 4U4R and 4U3U) were the receptor. Then, the receptor and ligand were prepared adding hydrogen atoms and assign partial charges using the force field. The ligands located in the active site (lactimidomycin and cycloheximide) were kept for HYBRID method. For each compound, we performed HYBRID docking using both PDB structures available. The grid box was 30 x 30 x 30 Å using the crystallographic structures as reference in the binding site. No solvent was used.

In the second part, the protein receptors and their ligands were pre-processed with AutoDock- Tools 1.5.6 [192]. The receptor was kept rigid (PDB entry: 4U4R), while maximal number of rotatable bonds was allowed for the ligands. All metal ions were manually removed. Dock- ing simulations were performed with Autodock Vina 1.1.2 [192] using genetic algorithm with

84 default parameters. The grid box was 20 x 20 x 20 Å centered in the center of the binding pocket using the crystallographic structures as reference (PDB entry: 4U4R).

4.3.3 We calculated the shape-similarity of the ligands

The shape-similarity coefficients were obtained using Tanimoto formula in ROCS 3.2.0.4 (Rapid Overlay of Chemical Structures) [193, 197]. The algorithm used here is based on the comparison of the three-dimensional shapes between molecules. The scored conformations from the OEDocking HYBRID method were used to perform the shape overlay method.

Tanimoto scores run from 0 to 1, where 1 means the molecule is identical to the reference and 0 means totally dissimilar. We also scored via a combined shape and chemical similarity called Combo. Combo is the sum of Shape Tanimoto and Color Tanimoto, presenting values between 0 and 2 [198, 199].

The goal of this method is to compare the shape adopted by the molecules in comparison to the query, finding similarity and possible “scaffold hopping”.

4.3.4 The new chlorolissoclimide structure influenced the later part

of our study

The docking and shape overlay calculation had two stages, one before and the other after the x-ray crystallography of chlorolissoclimide described above. In the first stage, we only tested four compounds (haterumaimide J, chlorolissoclimide (3), 38, and 39). The results of the first part gave us confidence to proceed with this technique after we obtained the crystal structure of chlorolissoclimide bound to the S. cerevisiae 80S ribosome and compared the binding modes we had predicted to be most likely for the four ligands. We found that

85 the predicted binding modes agreed well with the chlorolissoclimide crystal structure. Thus, going forward, we continued applying the same methodology, but we added chlorolissoclimide as a reference structure to the HYBRID method and overlaid the generated conformations onto the crystal structure of chlorolissoclimide as well.

4.3.5 Analysis

In this study, we provided poses for several new ligands based on a combined in silico and in cerebro approach. Consensus docking often does better at binding mode prediction than individual docking tools [188, 191, 200]. Thus, we used this methodology where we compared the poses generated from different docking methods to the known crystal structures, using an in cerebro approach, to predict the most likely binding pose of each ligand [201, 202, 203]. Shape overlays were used as an auxiliary tool to assist the in cerebro part of the analysis.

4.4 Results and Discussion

The docking and shape overlay calculation had two stages, before and after the x-ray crystal structure of chlorolissoclimide was obtained by collaborators. In the first stage we only tested four compounds (hatJ, chlorolissoclime and two analogues in Figure 4.5). The results of the first part gave us confidence to proceed with this technique after we got the crystal structure of chlorolissoclimide bound to the S. cerevisiae 80S ribosome and compared the binding modes we had predicted to be most likely for the four ligands. We found that the predicted binding modes agreed rather well with the chlorolissoclimide crystal structure (Figure 4.6). Thus, going forward, we continued applying the same methodology, but we added chlorolissoclimide as a reference structure to the HYBRID method and overlaid the generated conformations onto the crystal structure of chlorolissoclimide as well. In this

86 second part, we tested the remaining ligands.

Figure 4.5: Most likely binding mode for (A) chlorolissoclimide, (B) its analogue, (C) hateru- maimide J, and (D) its analogue.

4.4.1 Tanimoto score

We used the Tanimoto score to assess how similar are the molecules conformations compared to our references — where higher the score, higher similarity. The compounds showed high shape similarity with chlorolissoclimide, which means they likely can adopt very similar binding modes (Table 4.1). We obtained the most likely binding poses for the compounds using a consensus of results among different docking methods and via shape overlay analysis.

87 Figure 4.6: Chlorolissoclimide crystal structure in the binding site of the S. cerevisiae 80S ribosome, hydrogen bonds are in black dashed lines.

4.4.2 Structure-Activity Relationship

The results showed the C7 hydroxyl group interaction with Pro-56 residue is very important to the activity because the other ligands, which are less active, did not show this key inter- action. We observed the C18 hydroxyl group of compound 37 interacting with the G2793 nucleotide 4.8. However, the geometry of this interaction in compound 37 is not right for a good hydrogen bond, explaining why the activity is close to the activity of compound 38 (Figure 4.7), which does not have this group.

Compound 39 has the succinamide ring and the C12 hydroxyl group directed back from the plane. Because of the difference in the stereochemistry, the C12 hydroxyl group of compound 39 was positioned in the opposite directions in the binding pose comparing to chlorolissoclimide. This C12 hydroxyl group of compound 39 showed a hydrogen bond to a nitrogen of C2764 nucleotide with unusual geometry.

88 Figure 4.7: Most likely binding poses for compounds (A) 37, (B) 38, (C) 39, (D) 41, (E) 40, (F) hatN, (G) 3168, (H) hatQ, (J) 393, (K) 45, (L) dichlorolissoclimide, (M) hatJ, (N) anti1, (O) anti2,and (P) C3AC in the binding pocket of ribosome. Hydrogen bonds are represented in black dashed lines.

89 Table 4.1: Shape overlay results using ROCS Software

Reference Molecule ROCS Tanimoto ROCS Shape ROCS Color Combo Tanimoto Tanimoto Lactimidomycin Haterumaimide 0.659 0.451 0.207 Lactimidomycin Analogue H 0.664 0.394 0.230 Lactimidomycin Chloroliss 0.667 0.369 0.299 Lactimidomycin Analogue C 0.650 0.432 0.218 Cycloheximide Haterumaimide 0.972 0.665 0.307 Cycloheximide Analogue H 0.933 0.603 0.330 Cycloheximide Chloroliss 0.917 0.604 0.313 Cycloheximide Analogue C 0.887 0.618 0.269 Chlorolissoclimide 37 1.446 0.910 0.536 Chlorolissoclimide 38 1.635 0.936 0.700 Chlorolissoclimide 39 1.45 0.807 0.411 Chlorolissoclimide 41 1.508 0.681 0.872 Chlorolissoclimide 40 1.522 0.922 0.600 Chlorolissoclimide hatQ 1.780 0.952 0.828 Chlorolissoclimide 3168 1.650 0.950 0.700 Chlorolissoclimide hatN 1.561 0.881 0.681 Chlorolissoclimide 393 1.443 0.906 0.537 Chlorolissoclimide 45 1.638 0.929 0.709 Chlorolissoclimide Dichlorolissoclimide1.779 0.950 0.830 Chlorolissoclimide Haterumaimide 1.469 0.919 0.549 J Chlorolissoclimide Anti-hat 1 1.488 0.919 0.569 Chlorolissoclimide Anti-hat 2 1.348 0.879 0.468 Chlorolissoclimide C3AC 1.489 0.832 0.656

The carbon-carbon double bond in the decalin ring also affects 41 interactions with the ribosome. Compound 41 did not make interaction between its C12 hydroxyl group and ribosome, outcoming in low activity. Regarding compound 40, the C3 hydroxyl group in the decalin ring is directed in front of the plane was shown to be responsible for decrease the activity. According to the docking results, the C3 hydroxyl group is placed between two nucleotides (G2793 and G2794), where usually the chlorine atom sits in other compounds. This candidate binding mode appears to make favorable interactions with the ribosome, so the reason for its poor biological activity remains somewhat unclear. Perhaps there is an unfavorable electrostatic interaction between the C3 hydroxyl group and G2793 and G2794 nucleotides, or perhaps the biological activity data is due to other problems with the compound aside from its binding interactions, such as cell permeability.

90 HatN (Figure 4.7) is very similar to CL, except of having an acetoxy group in C7 instead of a hydroxyl group. Differently of the hydroxyl group which make interactions with Pro- 56, the acetoxy group did not show any interactions with the ribosome in the most likely binding pose. Also, this group affected the interaction of the C12 hydroxyl group showing no interaction with A2802.

The most likely binding pose of compound 3168 (Figure 4.7) showed the expected interac- tions related with succinamide ring and C12 hydroxyl group. HatQ is very similar with CL, the only difference is hatQ does not have the chlorine atom in its structure. The chlorine atom seems to make difference with the interaction of the C7 hydroxyl group, where we could see a hydrogen bond to Phe-58 instead of Pro-56 as we saw with CL. This can explain why the hatQ activity decreased.

Compound 393 has two chlorine atoms and a C18 hydroxyl group, which did not showed any significant interaction with the ribosome. Compound 45 is almost the same as the compound 393, it just does not have the C18 hydroxyl group on the structure. The results of our docking calculations could not explain why these two molecules have such difference in activity.

Using our methodology, we observed that many analogues require changes in the ligand or translation of the ligand for a reasonable binding according to their new functionality. The chlorolissoclimide analogues 39, 45, and 37 are a good examples of those different functional- ities and changes of conformation as shown in Figure 4.8. A specific example is the B ring of 45 that adopts a twist-boat conformation in which the then pseudo-equatorial chlorides are poised for two halogen–π interactions on the facing guanines (G2793/4) as shown in Figure 4.9.

91 Figure 4.8: Ligands bound to the 80S ribosome. (a) Co-crystal structure of CL, (b) docking-based structure of analogue 39, (c) docking-based structure of analogue 45, (d) docking-based structure of analogue 37.

4.4.3 Molecules without IC50.

Regarding the molecules which were not tested in cells, dichlorolissoclimide showed some apparently unfavorable interactions with the ribosome. The C12 hydroxyl group did not present interactions with the A3802 nucleotide, which seems to be important for the activity. The C7 hydroxyl group interacted with Phe-58, similar with hatQ. HatJ has only one chlorine and a hydroxyl group on C18, so, according to our predicted binding mode, should likely have decent to good activity.

The result showed the C12 hydroxyl group interacting to the ribosome, the same as seen with CL, and also C18 hydroxyl interacting with G2793 nucleotide. Anti-hat 1 showed all

92 Figure 4.9: Compound 45 bound to the 80s ribosome. The two halogen-π interactions on the facing guanines (G2793/4) are in red dashed lines. The hydrogen bonds are in black dashed lines. the key interactions with the ribosome, but the orientation inside the binding pocket is different compared to the other structures. On the other hand, anti-hat 2 adopted similar pose to dichlorolissoclimide. The last molecule tested was C3AC. Its most likely binding pose showed the contacts between the ribosome and the the succinamide ring and C12 hydroxyl group, but the ester group located at C3 did not make any additional interactions with the ribosome.

4.4.4 Docking versus Crystal Structure

After our computational study was done, we obtained the crystal structure of chlorolisso- climide bound to the yeast 80S ribosome. Figure 5.4 shows the docking predicted biding mode for chlorolissoclimide (green) and its analogue (orange) compared to the crystal structure

93 (magenta). The RMSD for our docking prediction were 1.6 Å and 1.2 Å for chlorolisso- climide and its analogue, respectively. Both biding poses predicted by our method showed the chlorine facing a guanine (G2793) on a halogen–π interaction.

(a) Chlorolissoclimide (b) Analogue C

Figure 4.10: Comparison of crystal structure and docking pose prediction. Co- crystal chlorolissoclimide (magenta) and the binding mode predictions of (a) chlorolisso- climide (green) and (b) analogue (orange), their RMSD are 1.6 Å and 1.2 Å, respectively. The hydrogen bonds between the co-crystal CL and ribosome are shown in black dashed lines.

Our results suggest that there are two key interactions between the compounds and the bind- ing site which explain most of the structure-activity relationship. Specifically, the hydrogen bond between C7 hydroxyl group and Pro56 appears to be the key, and the hydrogen bond between the hydroxyl along the C12 bond and the receptor also seems very important.

4.5 Conclusion

Our computational approach involved several docking calculations and shape overlay in order to understand the SAR of the compounds. The lissoclimide analogues tested in this study presented very similar predicted binding mode as CHX and CL. There are two key interac- tions between the compounds and the binding site related to the SAR. Specifically, the hy- drogen bond between C7 hydroxyl group and Pro56 appears to be the key, and the hydrogen

94 bond between the hydroxyl along the C12 bond and the receptor also seems very important. We found that the C18 hydroxyl group does not make any strong hydrogen-bonding inter- actions. C18 oxygenation has little effect on potency. One interesting finding was that the B ring of 45 adopts a twist-boat conformation in which the then pseudo-equatorial chlorides are poised for two halogen–π interactions on the facing guanines (G2793/4).

Docking is not considered a binding mode predicting tool. However, our in silico docking study and in cerebro analysis proved to be useful to understand some particular key inter- actions in the binding mode. Surprisingly, for one of the compounds that we obtained the co-crystal structure later, our predicted binding mode was extremely similar to it, giving us confidence regarding our methodology. Thus, the computational approach can be a reliable tool to SAR studies and could permit a greater understanding of lissoclimide analogues.

95 Chapter 5

Progress toward the comprehensive computational design of a macrocyclic peptide

5.1 Abstract

The demand for new antibiotic drugs has increased recently, especially with the spread of multi-drug-resistant bacteria. Facing this problem, new therapeutic possibilities are being considered. Antimicrobial peptides provide one such possibility. A macrocyclic peptide that mimics the CdiA-CTEC869 beta-hairpin was developed in previous studies. The macrocyclic peptide blocks the toxin-immunity network encoded by bacterial contact-dependent growth inhibition (CDI) systems. Unfortunately, this peptide binds only weakly, so here we seek to understand and redesign the macrocyclic peptide. Molecular Dynamics (MD) simulations and analysis of ligand-protein interactions were used to understand the behavior of the peptide. The secondary structure study showed the percentage Beta-sheet structure is mostly

96 kept during the production, corresponding to 39 % of the average secondary structure content of the peptide. The toxin structure showed more hydrogen bonds contacts than the original macrocyclic peptide. The toxin-protein interactions occur not only between residues in the Beta-sheet and the CdiI, but also in other regions of the toxin. The next step was to replace each residues of the peptide with alanine one at a time. We found some residues with no hydrogen bonding interactions with the protein and others with very important hydrogen bond interactions. In order to improve some peptide-protein interactions, several mutations in the original macrocyclic peptide was done accordingly to our results. The findings presented here may help to develop some potential antibiotics candidates to be tested in cells.

5.2 Introduction

The demand for new antibiotic drugs has increased recently, especially with the spread of multi-drug-resistant bacteria [204]. Facing this problem, new therapeutic possibilities are being considered. Antimicrobial peptides provide one such possibility.

Contact-dependent growth factor (CDI) is a system used by bacteria to block the growth of neighboring cells and protect themselves from auto inhibition by producing specific immunity proteins [205]. This strategy was recently discovered in gram-negative bacteria and requires a cell-to-cell contact by CdiA (a large exoprotein) and CdiB (a facilitator of the transport of CdiA) two-partner secretion system [205]. In the C-terminal of CdiA, there is the CDI toxin (CdiA-CT), and it contains 250-300 residues. It has been observed that more than 60 families of CdiA-CTs and different CdiA-CTs exhibit toxin activity, such as tRNase and DNase [206, 207, 208]. In order to prevent auto-inhibition, CDI+ cells have an immunity mechanism where CdiI proteins bind to CdiA-CT and block its activity [206, 207]. An important point is that this system is highly specific for its cognate pair, thus CdiI proteins can only block

97 their same species CdiA-CT [206]. However, some evidence of weak interactions between different CdiA-CT and CdiI from the same family has been observed [207].

Celia Goulding’s group has studied this particular system in E. coli EC869 (EC869) and Burkholderia pseudomallei 1026b (Bp1026b) and determined their CdiA-CT/CdiI complex X-ray crystal structures (Figure 5.1) [208]. They compared the structures of the CdiA-CT toxin domains from both of species and found approximately 15 % of sequence identity. However, their 3D-structures superimposed showed an RMSD of 3.9 Å and a structurally similarity to type IIS restriction endonuclease. One of the differences between these two toxins is the two additional β-strands (β4 and β5) in EC869 toxin that form a salient β-hairpin into CdiIEC869, producing a six-stranded antiparallel β-sheet. In contrast, the Bp1026b toxin structure does not have any β-hairpin [208].

Figure 5.1: Crystal structure of CdiA-CT (green) and CdiI (cyan).

98 Figure 5.2: Crystal structure of macrocyclic peptide [3].

According to some studies, the residues in the toxin β-hairpin have a dominant diversification within the “EC869-like” family, showing that this region may be important for CdiA-CT/CdiI binding specificity [209]. With all this information, the laboratory of James Nowick (UC Irvine) designed and synthesized a macrocyclic β-peptide mimic (CT-MAC), that has a γ-linked ornithine turn unit (Figure 5.2) [3]. Unfortunately, in vitro assay of CT-MAC inhibition of CdiA-CT/CdiI complex (EC869) in Goulding’s lab was inconclusive, and the co-crystal showed the β-strands are shorter than the CdiA-CT ones, suggesting that the stability of this β-hairpin might be compromised [3].

Thus, in order to analyze the interactions between the CdiA-CT and CdiI and improve binding affinity of the macrocyclic peptide, we used computational approaches. Molecular dynamics simulations were used to assess the differences in affinity and analyze interactions between the macrocyclic peptide and Cdil, and CdiA-CT and Cdil.

5.3 Methods

The molecular dynamics simulations were performed using GROMACS v4.6.5 [164]. The CdiA-CT toxin, the CdiI (PDB code: 4G6U) [208], and the macrocyclic peptide structures were prepared using tleap (Amber software) [169] and several GROMACS v4.6.5 tools to

99 add solvent to coordinate file.

We made mutations (Table 5.1) in some residues of the macrocyclic peptide using the PyMOL v1.7 [210] mutagenesis function.

In the first step, the entire system was minimized. The second step was the equilibration which was done in NVT (constant number of atoms, volume, and temperature) ensemble for 50 ps. And finally, the production was performed in 5 ns, with snapshots stored every 100 picoseconds, producing 500 trajectory snapshots.

Using GROMACS, we could analyze the data from our simulation, such as calculating the number of hydrogen bonds between the complex CdiA-CT/CdiI as a function of time. We also performed DSSP (Dictionary of Secondary Structure of Proteins) analysis, in order to analyze the secondary structure of the peptide, checking if the β-hairpin remained stable during the simulation. We also examined the peptide’s root mean square deviation (RMSD) and the root mean square fluctuation (RMSF).

5.3.1 Initial structures preparation

The structure of CdiI complexed with the toxin was obtained from Protein Data Bank (PDB code: 4G6U) [208]. The macrocyclic peptide structure was provided as courtesy by Nowick group (University of California, Irvine). All crystallized water molecules and Yttrium atom were removed. Mutations were made using Mutagenesis tool from PyMOL Software v1.7 [210]. Coordinates and topology files were generated using tleap (Amber) using Amber FF99SB Forcefield [211]. Amber files were converted to GROMACS files using acpype [212]. The structures were immersed in dodecahedron boxes using editconf (GROMACS v4.6.5), and TIP3P water model was added using genbox (GROMACS v4.6.5) [164].

100 5.3.2 Molecular dynamics simulations

Molecular Dynamics (MD) were carried out using GROMACS v4.6.7 [164]. A 5 ps energy minimization was first performed. After that, the system was equilibrated under NVT condition in 50 ps. And finally, a 5 ns production molecular dynamics was carried out.

5.3.3 Analysis

Hydrogen bond analysis Hydrogen bonds was computed and analyzed by g_hbond module in GROMACS v. 4.6.5 [164], using their default settings (cutoffs for distance equal 3 Åand angle equal 120◦). Each residue in the macrocyclic peptide was separately analyzed.

RMSD and RMSF The wild type and mutates macrocyclic peptides were also inves- tigated by root mean square deviation (RMSD) by GROMACS [164]. Also, to reflect the mobility of certain residues around the mean position, root mean square fluctuation (RMSF) were calculated using GROMACS [164].

Secondary structures analysis DSSP (Dictionary of Secondary Structure of Proteins) program was used to calculate he secondary structure evolution of the macrocyclic peptide and mutations over time. We used the GROMACS tool called do_dssp, and the resulted file was processed using xpm2ps tool [164].

5.4 Results and Discussion

The secondary structure study showed the β-sheet structure is mostly kept during the sim- ulation for the original peptide, 38 % of the residues were in β-sheet structure, 29 % were in

101 β bridge, and 44 % were in coil.

The toxin structure showed more hydrogen bond contacts with the protein than the original macrocyclic peptide had. The toxin-protein interactions occur not only between residues in the β-sheet and the CdiI, but also in other regions of the toxin. Those extra contacts can help to stabilize the toxin-protein complex and increase its binding affinity. Regarding the peptide and protein interaction, we found some residues with no hydrogen bonding interactions with the protein and other residues with very important hydrogen bond interactions.

102 Table 5.1: List of mutations created from the original macrocyclic peptide. The amino acids mutated from the original peptide are in bold letters.

Mutations Sequence

mut1 S ORN KEYALAGRELT mut2 A ORN KEYALSGRELT mut3 S ORN KEYALSARELT mut4 S ORN KEYALSGAELT mut5 S ORN KEYALSGRALT mut6 S ORN KEYALSGREAT mut7 S ORN KEYALSGRELA mut8 S ORN AEYALSGRELT mut9 S ORN KAYALSGRELT mut10 S ORN KEAALSGRELT mut11 S ORN KEYAASGRELT mut12 S ORN KECALSGRECT mut13 S ORN KEYALKGRECT mut14 S ORN KEYALSYRECT mut15 S ORN KEYAESGRECT mut16 S ORN KEYAQSGRECT mut17 S ORN KEYAKSGRECT mut18 S ORN KEYAQKGRELT mut19 S ORN KEYAL(D- Ser)GRELT mut20 CEYALAGRELTC

We decided to start with relatively conservative changes to the peptide, since it is already active, but it is not active enough. So, the next step was to replace each residues of the peptide with alanine one at a time. Changing different residues to alanine is a technique that can help to identify key interactions and their value by looking at how much this impacts the peptide-protein complex. The alanine scan can provide information to help inform new

103 ideas.

In order to improve some peptide-protein interactions, we made some mutations in the original macrocyclic peptide using our previous analysis. The findings presented here may help to develop some potential antibiotics candidates to be tested in cells.

The first approach was to substitute each residue of the original peptide by alanine and check how this impact the peptide-protein interactions. The average number of hydrogen bonds during the simulation can be seen in Figure 5.3.

Figure 5.3: Hydrogen bonds average per residue for alanine mutations (mut1 to mut11) to Cdil. The asterisk (*) represents where the alanine mutation is for each mutated pep- tide. Some of alanine mutations drastically reduced the average number of hydrogen bonds between the peptide and Cdil protein.

We can see there are some residues that does not make any hydrogen bond with any other

104 residue such as both Leucine (Leu-7 and Leu-12), but the same is seen in the toxin. Glu- tamate (Glu-4) is responsible for important interactions to the protein, which decreased considerably when substituted by alanine. A similar behavior was seen with Arginine (Arg- 10) and Glutamate (Glu-11) as well. Some other residues make hydrogen bonds with the protein by the backbone, thus the type of residue does not interfere a lot with the number of hydrogen bonds in that residue. Lysine (Lys-3) showed to make hydrogen bond between its nitrogen and protein, so switch this residue to alanine do not give any hydrogen bond. The hydrogen bonds of threonine (Thr-13) seems to fluctuate independent of mutations, conclud- ing that probably hydrogen bonds depend on where the residue is positioned or where the protein is positioned (maximum of two hydrogen bonds).

Table 5.2: RMSF of the backbone of mut1 to mut11 using GROMACS.

mut1 mut2 mut3 mut4 mut5 mut6 mut7 mut8 mut9 mut10 mut11 S -0.0282 -0.0001 -0.0146 0.0039 -0.0862 -0.0116 0.0403 0.0165 -0.0431 0.0219 -0.0252 ORN -0.0925 -0.0200 -0.0036 -0.0206 -0.0146 -0.0094 0.0491 -0.0031 -0.0524 0.0170 -0.0228 K -0.1501 0.0197 0.0207 -0.0022 0.0390 0.0153 0.0495 0.0479 -0.0106 0.0177 0.0121 E -0.0275 -0.0117 -0.0210 -0.0201 -0.0220 -0.0046 -0.0180 -0.0122 -0.0325 -0.0164 -0.0203 Y -0.0001 -0.0071 0.0092 -0.0261 0.0013 -0.0119 -0.1175 -0.0366 -0.0090 0.0215 -0.0020 A 0.0021 -0.0066 0.0024 -0.0148 -0.0030 -0.0046 -0.0029 -0.0092 -0.0134 -0.0096 -0.0018 L -0.0016 -0.0310 0.0041 -0.0247 0.0028 -0.0157 -0.0363 0.0021 -0.0259 -0.0423 0.0136 S 0.0038 -0.0171 0.0052 -0.0370 -0.0101 -0.0217 -0.0068 -0.0144 -0.0225 -0.0367 -0.0116 G 0.0013 -0.0056 -0.0037 -0.0333 -0.0040 -0.0041 0.0002 -0.0087 -0.0173 -0.0150 -0.0071 R 0.0004 -0.0057 -0.0038 -0.0238 -0.0008 -0.0033 -0.0016 -0.0023 -0.0183 -0.0067 -0.0017 E 0.0228 0.0065 0.0044 0.0001 0.0315 0.0138 0.0091 0.0143 0.0172 0.0293 0.0051 L 0.0006 -0.0310 -0.0236 -0.0070 -0.0202 0.0414 0.0081 0.0072 -0.0228 0.0158 -0.0007 T -0.0684 0.0045 -0.0312 -0.0159 -0.0527 -0.0071 0.0155 0.0079 -0.0551 -0.0077 -0.0587

We calculated the root mean square deviation (RMSF) of the backbone using GROMACS. The RMSF reflects the mobility of certain residue around its mean position. The alanine mu- tations showed little effect on the mobility of the peptide (Table 5.2). The alanine mutations also did not show high RMSD over time, and the peptides alone and the peptide-protein complexes remained stable (Figure 5.4a and 5.4b).

Regarding the 3D conformation of the peptide, we performed DSSP analysis of secondary structure. The original peptide appeared to stay in the β-sheet conformation for most of the

105 (a) RMSD of the ligand (b) RMSD of the complex ligand-protein

Figure 5.4: RMSD of the backbone for the (a) ligand and (b) complex ligand-protein for mutation 1 to 11. All mutated peptide showed a constant RMSD and variations below 0.3 nm. The peptide alone shows more flexibility than when it is bound to the protein. simulation, and the turn formed by residues 9, 10, and 11 were stable during the simulation (Figure 5.5A). When we made a disulfide bridge (mut12), the β-sheet conformation was completely lost (Figure 5.5B). This analysis also reveals that some mutations did not show the bend observed in the original peptide (Figure 5.5C).

Regarding the mutations 13 to 20, the hydrogen bond average over time is in Figure 5.6. We could notice that replacing some of the residues in the turn for larger residues, we increased the number of interactions between peptide and protein. The D-Ser changing did not improve in the hydrogen bond average. The disulfide bridge placed in the middle of the peptide affected the entire secondary structure of the macrocyclic peptide, losing its β-sheet structure along the simulation.

The conclusion we have now is during the time of the simulation (5 ns) the residues in the β-hairpin of the original peptide were stable. We also observed that the residues in the turn make poor contact with the immunity protein. The toxin showed interactions in four residues that are not present in the original peptide, but these residues are not adjacent to the β-hairpin, three of them are located more than 40 residues away from the Lys in the β-hairpin.

106 Figure 5.5: Secondary structures analysis (DSSP). Secondary structures of the (A) original macrocyclic peptide, (B) mut12 and (C) mut2 over the time. The original macrocyclic peptide kept its β-sheet structure considerable stable along the time. The mut12, which has two cysteine mutations to form a bridge disulfide bridge, lost its β-sheet structure along the time and also the turn was compromised. The mut2 is a more conservative mutation, where a serine residue was switched to an alanine, the β-sheet structure was kept along the time, but the peptide lost its bent shape.

107 Figure 5.6: Hydrogen bonds average per residue for mutations (mut13 to mut20). The asterisk (*) represents where the mutations is for each mutated peptide. Some of mutations drastically reduced the average number of hydrogen bonds between the peptide and Cdil protein.

5.5 Conclusion

The preliminary data found in this study shows that the β-hairpin seems stable during the short simulation. This secondary structure is an important characteristic for this particular macrocyclic peptide since the toxin part that binds to the potein has the same structure. Even though long simulations are required in order to see complex protein movements, mutations in key points were able to destabilize the macrocyclic peptide secondary structure.

Peptide synthesis and laboratory tests need to be performed in order to validate the com-

108 putational results. So, more work needs to be done in order to select the best candidates, requiring synthesis of new macrocyclic peptides computationally redesigned and tests in vitro/in vivo. Extra tools, such as Rosetta [https://boinc.bakerlab.org/], can be helpful in order to precisely improve the peptide design for a better binding affinity to the protein.

109 Bibliography

[1] Guilherme Duarte Ramos Matos, Daisy Y. Kyu, Hannes H. Loeffler, John D. Chodera, Michael R. Shirts, and David L. Mobley. Approaches for Calculating Solvation Free Energies and Enthalpies Demonstrated with an Update of the FreeSolv Database. Journal of Chemical & Engineering Data, 62(5):1559–1569, May 2017.

[2] Nicolas Garreau de Loubresse, Irina Prokhorova, Wolf Holtkamp, Marina V. Rodnina, Gulnara Yusupova, and Marat Yusupov. Structural basis for the inhibition of the eukaryotic ribosome. Nature, 513(7519):517–522, 2014.

[3] Robert P Morse, Julia L E Willett, Parker M Johnson, Jing Zheng, Alfredo Credali, Angelina Iniguez, James S Nowick, Christopher S Hayes, and Celia W Goulding. Di- versification of β-Augmentation Interactions between CDI Toxin/Immunity Proteins. Journal of molecular biology, 427(23):3766–84, November 2015.

[4] Daylight Chemical Information Systems Inc. Daylight Theory Manual. Daylight Chem- ical Information Systems Inc., Daylight Version 4.9, 2011.

[5] Norman L. Allinger, Mary Ann Miller, Libby W. Chow, R. A. Ford, and J. C. Gra- ham. The Calculated Electronic Spectra and Structures of Some Cyclic Conjugated Hydrocarbons1,2. Journal of the American Chemical Society, 87(15):3430–3435, Au- gust 1965.

[6] S. Lifson and A. Warshel. Consistent Force Field for Calculations of Conformations, Vibrational Spectra, and Enthalpies of Cycloalkane and n-Alkane Molecules. The Journal of Chemical Physics, 49(11):5116–5129, December 1968.

[7] Steven M Paul, Daniel S Mytelka, Christopher T Dunwiddie, Charles C Persinger, Bernard H Munos, Stacy R Lindborg, and Aaron L Schacht. How to improve R&D productivity: The pharmaceutical industry’s grand challenge. Nature reviews. Drug discovery, 9(3):203–14, March 2010.

[8] Ish Khanna. Drug discovery in pharmaceutical industry: Productivity challenges and trends. Drug Discovery Today, 17(19-20):1088–1102, October 2012.

[9] Gisbert Schneider. Automating drug discovery. Nature Reviews Drug Discovery, 17(2):97–113, February 2018.

110 [10] Caterina Bissantz, Bernd Kuhn, and Martin Stahl. A medicinal chemist’s guide to molecular interactions. Journal of medicinal chemistry, 53(14):5061–84, July 2010.

[11] John L Klepeis, Kresten Lindorff-Larsen, Ron O Dror, and David E Shaw. Long- timescale molecular dynamics simulations of protein structure and function. Current Opinion in Structural Biology, 19(2):120–127, April 2009.

[12] Benjamin D. Sellers, Natalie C. James, and Alberto Gobbi. A Comparison of Quantum and Molecular Mechanical Methods to Estimate Strain Energy in Druglike Fragments. Journal of Chemical Information and Modeling, 57(6):1265–1275, June 2017.

[13] Nina M. Fischer, Paul J. van Maaren, Jonas C. Ditz, Ahmet Yildirim, and David van der Spoel. Properties of Organic Liquids when Simulated with Long-Range Lennard- Jones Interactions. Journal of Chemical Theory and Computation, 11(7):2938–2944, July 2015.

[14] Clara D. Christ, Alan E. Mark, and Wilfred F. van Gunsteren. Basic ingredients of free energy calculations: A review. Journal of Computational Chemistry, 31(8):1569–1582, 2010.

[15] Andrew Pohorille, Christopher Jarzynski, and Christophe Chipot. Good Practices in Free-Energy Calculations. Journal of Physical Chemistry B, 114(32):10235–10253, January 2010.

[16] Noor Asidah Mohamed, Richard T. Bradshaw, and Jonathan W. Essex. Evaluation of solvation free energies for small molecules with the AMOEBA polarizable force field. Journal of Computational Chemistry, 37(32):2749–2758, December 2016.

[17] Dian Jiao, Pavel A. Golubkov, Thomas A. Darden, and Pengyu Ren. Calculation of protein–ligand binding free energy by using a polarizable potential. Proceedings of the National Academy of Sciences, 105(17):6290–6295, April 2008.

[18] Niels Hansen and Wilfred F. van Gunsteren. Practical Aspects of Free-Energy Calcu- lations: A Review. Journal of Chemical Theory and Computation, 10(7):2632–2647, July 2014.

[19] Sushil K. Mishra, Gaetano Calabró, Hannes H. Loeffler, Julien Michel, and Jaroslav Koča. Evaluation of Selected Classical Force Fields for Alchemical Binding Free Energy Calculations of Protein-Carbohydrate Complexes. Journal of Chemical Theory and Computation, 11(7):3333–3345, July 2015.

[20] David L. Mobley and Michael K. Gilson. Predicting Binding Free Energies: Frontiers and Benchmarks. Annual Review of Biophysics, 46(1):531–558, 2017.

[21] Stefano Piana, Kresten Lindorff-Larsen, and David E. Shaw. How Robust Are Pro- tein Folding Simulations with Respect to Force Field Parameterization? Biophysical Journal, 100(9):L47–L49, May 2011.

111 [22] Thomas P. Senftle, Sungwook Hong, Md Mahbubul Islam, Sudhir B. Kylasa, Yuanxia Zheng, Yun Kyung Shin, Chad Junkermeier, Roman Engel-Herbert, Michael J. Janik, Hasan Metin Aktulga, Toon Verstraelen, Ananth Grama, and Adri C. T. van Duin. The ReaxFF reactive force-field: Development, applications and future directions. npj Computational Materials, 2:15011, March 2016.

[23] Wilfred F. van Gunsteren and Herman J. C. Berendsen. Computer Simulation of Molec- ular Dynamics: Methodology, Applications, and Perspectives in Chemistry. Ange- wandte Chemie International Edition in English, 29(9):992–1023, September 1990.

[24] Archan Dey, Narendra Nath Pati, and Gautam R. Desiraju. Crystal structure predic- tion with the supramolecular synthon approach: Experimental structures of 2-amino-4- ethylphenol and 3-amino-2-naphthol and comparison with prediction. CrystEngComm, 8(10):751–755, 2006.

[25] Kresten Lindorff-Larsen, Stefano Piana, Kim Palmo, Paul Maragakis, John L. Klepeis, Ron O. Dror, and David E. Shaw. Improved side-chain torsion potentials for the Amber ff99SB protein force field. Proteins: Structure, Function, and Bioinformatics, 78(8):1950–1958, June 2010.

[26] Kenno Vanommeslaeghe, Mingjun Yang, and Alexander D. MacKerell. Robustness in the fitting of molecular mechanics parameters. Journal of Computational Chemistry, 36(14):1083–1101, 2015.

[27] Lingle Wang, Yujie Wu, Yuqing Deng, Byungchan Kim, Levi Pierce, Goran Krilov, Dmitry Lupyan, Shaughnessy Robinson, Markus K. Dahlgren, Jeremy Greenwood, Donna L. Romero, Craig Masse, Jennifer L. Knight, Thomas Steinbrecher, Thijs Beum- ing, Wolfgang Damm, Ed Harder, Woody Sherman, Mark Brewer, Ron Wester, Mark Murcko, Leah Frye, Ramy Farid, Teng Lin, David L. Mobley, William L. Jorgensen, Bruce J. Berne, Richard A. Friesner, and Robert Abel. Accurate and Reliable Predic- tion of Relative Ligand Binding Potency in Prospective Drug Discovery by Way of a Modern Free-Energy Calculation Protocol and Force Field. Journal of the American Chemical Society, 137(7):2695–2703, February 2015.

[28] Matteo Aldeghi, Alexander Heifetz, Michael J. Bodkin, Stefan Knapp, and Philip C. Biggin. Accurate calculation of the absolute free energy of binding for drug molecules. Chem. Sci., 7(1):207–218, 2016.

[29] Caitlin C. Bannan, Kalistyn H. Burley, Michael Chiu, Michael R. Shirts, Michael K. Gilson, and David L. Mobley. Blind prediction of cyclohexane–water distribution co- efficients from the SAMPL5 challenge. Journal of Computer-Aided Molecular Design, 30(11):1–18, September 2016.

[30] S. Wu, P. Angelikopoulos, C. Papadimitriou, R. Moser, and P. Koumoutsakos. A hier- archical Bayesian framework for force field selection in molecular dynamics simulations. Phil. Trans. R. Soc. A, 374(2060):20150032, February 2016.

112 [31] Victor M. Anisimov, Igor V. Vorobyov, Benoît Roux, and Alexander D. MacKerell. Po- larizable Empirical Force Field for the Primary and Secondary Alcohol Series Based on the Classical Drude Model. Journal of Chemical Theory and Computation, 3(6):1927– 1946, November 2007.

[32] Luca Monticelli and D.Peter Tieleman. Force Fields for Classical Molecular Dynamics. In Luca Monticelli and Emppu Salonen, editors, Biomolecular Simulations, number 924 in Methods in Molecular Biology, pages 197–213. Humana Press, January 2013.

[33] Jay W. Ponder, Chuanjie Wu, Pengyu Ren, Vijay S. Pande, John D. Chodera, Michael J. Schnieders, Imran Haque, David L. Mobley, Daniel S. Lambrecht, Robert A. DiStasio, Martin Head-Gordon, Gary N. I. Clark, Margaret E. Johnson, and Teresa Head-Gordon. Current Status of the AMOEBA Polarizable Force Field. The Journal of Physical Chemistry B, 114(8):2549–2564, March 2010.

[34] Lee-Ping Wang, Todd J. Martinez, and Vijay S. Pande. Building Force Fields: An Au- tomatic, Systematic, and Reproducible Approach. The Journal of Physical Chemistry Letters, 5(11):1885–1891, June 2014.

[35] Johnny C. Wu, Gaurav Chattree, and Pengyu Ren. Automation of AMOEBA polariz- able force field parameterization for small molecules. Theoretical Chemistry Accounts, 131(3):1138, March 2012.

[36] Thomas A. Halgren. Merck molecular force field. I. Basis, form, scope, parameteriza- tion, and performance of MMFF94. Journal of Computational Chemistry, 17(5-6):490– 519, April 1996.

[37] Thomas A. Halgren. Merck molecular force field. II. MMFF94 van der Waals and electrostatic parameters for intermolecular interactions. Journal of Computational Chemistry, 17(5-6):520–552, April 1996.

[38] Wolfgang Damm, Antonio Frontera, Julian Tirado–Rives, and William L. Jorgensen. OPLS all-atom force field for carbohydrates. Journal of Computational Chemistry, 18(16):1955–1970, December 1997.

[39] Fabien Cailliez and Pascal Pernot. Statistical approaches to forcefield calibration and prediction uncertainty in molecular simulation. The Journal of Chemical Physics, 134(5):054124, February 2011.

[40] Matthew T. Geballe and J. Peter Guthrie. The SAMPL3 blind prediction challenge: Transfer energy overview. Journal of Computer-Aided Molecular Design, 26(5):489– 496, April 2012.

[41] F. Rizzi, H. Najm, B. Debusschere, K. Sargsyan, M. Salloum, H. Adalsteinsson, and O. Knio. Uncertainty Quantification in MD Simulations. Part II: Bayesian Inference of Force-Field Parameters. Multiscale Modeling & Simulation, 10(4):1460–1492, January 2012.

113 [42] Stephan Deublein, Patrick Metzler, Jadran Vrabec, and Hans Hasse. Automated de- velopment of force fields for the calculation of thermodynamic properties: Acetonitrile as a case study. Molecular Simulation, 39(2):109–118, 2013. [43] Andreas Köster, Thomas Spura, Gábor Rutkai, Jan Kessler, Hendrik Wiebeler, Jad- ran Vrabec, and Thomas D. Kühne. Assessing the accuracy of improved force-matched water models derived from Ab initio molecular dynamics simulations. Journal of Com- putational Chemistry, pages n/a–n/a, May 2016. [44] Chad W. Hopkins and Adrian E. Roitberg. Fitting of Dihedral Terms in Classical Force Fields as an Analytic Linear Least-Squares Problem. Journal of Chemical Information and Modeling, 54(7):1978–1986, July 2014. [45] Florent Hédin, Krystel El Hage, and Markus Meuwly. A Toolkit to Fit Nonbonded Pa- rameters from and for Condensed Phase Simulations. Journal of Chemical Information and Modeling, 56(8):1479–1489, August 2016. [46] Igor I. Baskin, David Winkler, and Igor V. Tetko. A renaissance of neural networks in drug discovery. Expert Opinion on Drug Discovery, 11(8):785–795, August 2016. [47] Jing Huang, Sarah Rauscher, Grzegorz Nawrocki, Ting Ran, Michael Feig, Bert L. de Groot, Helmut Grubmüller, and Alexander D. MacKerell Jr. CHARMM36m: An improved force field for folded and intrinsically disordered proteins. Nature Methods, 14(1):71–73, January 2017. [48] L. Le Folgoc, H. Delingette, A. Criminisi, and N. Ayache. Quantifying Registration Uncertainty With Sparse Bayesian Modelling. IEEE Transactions on Medical Imaging, 36(2):607–617, February 2017. [49] Volker Lesch, Diddo Diddens, Carlos E. S. Bernardes, Benjamin Golub, Alain Dequidt, Veronika Zeindlhofer, Marcello Sega, and Christian Schröder. ForConX: A forcefield conversion tool based on XML. Journal of Computational Chemistry, 38(9):629–638, April 2017. [50] Justin S. Smith, Olexandr Isayev, and Adrian E. Roitberg. ANI-1: An extensible neural network potential with DFT accuracy at force field computational cost. Chem. Sci., 8(4):3192–3203, 2017. [51] Peng Xu, Emilie B. Guidez, Colleen Bertoni, and Mark S. Gordon. Perspective: Ab initio force field methods derived from quantum mechanics. The Journal of Chemical Physics, 148(9):090901, March 2018. [52] John D. Chodera, Michael K Gilson, David L Mobley, Michael R Shirts, Lee-Ping Wang, Kenneth Kroenlein, and Christopher I Bayly. Open Force Field Group. open- forcefield.org, 2018. [53] Pengyu Ren and Jay W. Ponder. Consistent treatment of inter- and intramolecular polarization in molecular mechanics calculations. Journal of Computational Chemistry, 23(16):1497–1506, December 2002.

114 [54] Alan Grossfield, Pengyu Ren, and Jay W. Ponder. Ion Solvation Thermodynamics from Simulation with a Polarizable Force Field. Journal of the American Chemical Society, 125(50):15671–15682, December 2003. [55] Sergei Yu. Noskov, Guillaume Lamoureux, and Benoît Roux. Molecular Dynamics Study of Hydration in Ethanol-Water Mixtures Using a Polarizable Force Field. The Journal of Physical Chemistry B, 109(14):6705–6713, April 2005. [56] Haibo Yu, Daan P. Geerke, Haiyan Liu, and Wilfred F. van Gunsteren. Molecular dy- namics simulations of liquid methanol and methanol–water mixtures with polarizable models. Journal of Computational Chemistry, 27(13):1494–1504, October 2006. [57] Marie L. Laury, Lee-Ping Wang, Vijay S. Pande, Teresa Head-Gordon, and Jay W. Ponder. Revised Parameters for the AMOEBA Polarizable Atomic Multipole Water Model. The Journal of Physical Chemistry B, 119(29):9423–9437, July 2015. [58] Justin A. Lemkul, Jing Huang, Benoît Roux, and Alexander D. MacKerell. An Em- pirical Polarizable Force Field Based on the Classical Drude Oscillator Model: Devel- opment History and Recent Applications. Chemical Reviews, 116(9):4983–5013, May 2016. [59] Cui Liu, Yue Li, Bing-Yu Han, Li-Dong Gong, Li-Nan Lu, Zhong-Zhi Yang, and Dong- Xia Zhao. Development of the ABEEMσπ Polarization Force Field for Base Pairs with Amino Acid Residue Complexes. Journal of Chemical Theory and Computation, April 2017. [60] Yunling Liu, Lan Tao, Jianjun Lu, Shuo Xu, Qin Ma, and Qingling Duan. A novel force field parameter optimization method based on LSSVR for ECEPP. FEBS Letters, 585(6):888–892, March 2011. [61] Panagiotis Angelikopoulos, Costas Papadimitriou, and Petros Koumoutsakos. Bayesian uncertainty quantification and propagation in molecular dynamics simulations: A high performance computing framework. The Journal of Chemical Physics, 137(14):144103, October 2012. [62] Olujide O. Olubiyi and Birgit Strodel. Topology and parameter data of thirteen non- natural amino acids for molecular simulations with CHARMM22. Data in Brief, 9:642– 647, October 2016. [63] Lee-Ping Wang, Keri A. McKiernan, Joseph Gomes, Kyle A. Beauchamp, Teresa Head- Gordon, Julia E. Rice, William C. Swope, Todd J. Martínez, and Vijay S. Pande. Building a More Predictive Protein Force Field: A Systematic and Reproducible Route to AMBER-FB15. The Journal of Physical Chemistry B, 121(16):4023–4039, April 2017. [64] Jack Wildman, Peter Repiščák, Martin J. Paterson, and Ian Galbraith. General Force- Field Parametrization Scheme for Molecular Dynamics Simulations of Conjugated Ma- terials in Solution. Journal of Chemical Theory and Computation, 12(8):3813–3824, August 2016.

115 [65] Rui Qi, Qiantao Wang, and Pengyu Ren. General van der Waals potential for common organic molecules. Bioorganic & Medicinal Chemistry, 2016.

[66] V. Botu, R. Batra, J. Chapman, and R. Ramprasad. Machine Learning Force Fields: Construction, Validation, and Outlook. The Journal of Physical Chemistry C, 121(1):511–522, January 2017.

[67] Federico Zahariev, Nuwan De Silva, Mark S. Gordon, Theresa L. Windus, and Marilu Dick-Perez. ParFit: A Python-Based Object-Oriented Program for Fitting Molecu- lar Mechanics Parameters to ab Initio Data. Journal of Chemical Information and Modeling, February 2017.

[68] Vijay Pande. Forcebalance: A Systematic, Reproducible, Statistically Driven Approach to More Accurate Molecular Dynamics Models. Biophysical Journal, 106(2):44a, Jan- uary 2014.

[69] Lee-Ping Wang, Teresa Head-Gordon, Jay W. Ponder, Pengyu Ren, John D. Chodera, Peter K. Eastman, Todd J. Martinez, and Vijay S. Pande. Systematic Improve- ment of a Classical Molecular Model of Water. The Journal of Physical Chemistry B, 117(34):9956–9972, August 2013.

[70] Hedieh Torabifard, Oleg N. Starovoytov, Pengyu Ren, and G. Andrés Cisneros. Devel- opment of an AMOEBA water model using GEM distributed multipoles. Theoretical Chemistry Accounts, 134(8):101, August 2015.

[71] Xiaojiao Mu, Qiantao Wang, Lee-Ping Wang, Stephen D. Fried, Jean-Philip Piquemal, Kevin N. Dalby, and Pengyu Ren. Modeling Organochlorine Compounds and the σ-Hole Effect Using a Polarizable Multipole Force Field. The Journal of Physical Chemistry B, 118(24):6456–6465, June 2014.

[72] Richard T. Bradshaw and Jonathan W. Essex. Evaluating Parametrization Protocols for Hydration Free Energy Calculations with the AMOEBA Polarizable Force Field. Journal of Chemical Theory and Computation, 12(8):3871–3883, August 2016.

[73] Alex Albaugh, Henry A. Boateng, Richard T Bradshaw, Omar N. Demerdash, Jacek Dziedzic, Yuezhi Mao, Daniel Toby Margul, Jason M Swails, Qiao Zeng, David A. Case, Peter Kenneth Eastman, Jonathan W. Essex, Martin Head-Gordon, Vijay S. Pande, Jay William Ponder, Yihan Shao, Chris-Kriton Skylaris, Ilian T. Todorov, Mark Ed- ward Tuckerman, and Teresa Head-Gordon. Advanced Potential Energy Surfaces for Molecular Simulation. The Journal of Physical Chemistry B, August 2016.

[74] Svetlana Artemova, Léonard Jaillet, and Stephane Redon. Automatic molecular struc- ture perception for the universal force field. Journal of Computational Chemistry, 37(13):1191–1205, May 2016.

[75] Guillermo Avendaño-Franco and Aldo H. Romero. Firefly Algorithm for Structural Search. Journal of Chemical Theory and Computation, 12(7):3416–3428, July 2016.

116 [76] Daniel J. Cole, Jonah Z. Vilseck, Julian Tirado-Rives, Mike C. Payne, and William L. Jorgensen. Biomolecular Force Field Parameterization via Atoms-in-Molecule Electron Density Partitioning. Journal of Chemical Theory and Computation, 12(5):2312–2323, May 2016.

[77] K. Vanommeslaeghe and A. D. MacKerell. Automation of the CHARMM General Force Field (CGenFF) I: Bond Perception and Atom Typing. Journal of Chemical Information and Modeling, 52(12):3144–3154, December 2012.

[78] Qian Zhang, Wei Zhang, Youyong Li, Junmei Wang, Liling Zhang, and Tingjun Hou. A rule-based algorithm for automatic bond type perception. Journal of , 4:26, October 2012.

[79] Junmei Wang, Wei Wang, Peter A. Kollman, and David A. Case. Automatic atom type and bond type perception in molecular mechanical calculations. Journal of Molecular Graphics and Modelling, 25(2):247–260, October 2006.

[80] Joseph D. Yesselman, Daniel J. Price, Jennifer L. Knight, and Charles L. Brooks. MATCH: An atom-typing toolset for molecular mechanics force fields. Journal of Computational Chemistry, 33(2):189–202, January 2012.

[81] Hans-Christian Ehrlich and Matthias Rarey. Systematic benchmark of substructure search in molecular graphs - From Ullmann to VF2. Journal of Cheminformatics, 4(1):13, December 2012.

[82] David L. Mobley, Caitlin C. Bannan, Andrea Rizzi, Christopher I. Bayly, John D. Chodera, Victoria T. Lim, Nathan M. Lim, Kyle A. Beauchamp, David R. Slochower, Michael R. Shirts, Michael K. Gilson, and Peter K. Eastman. Escaping Atom Types in Force Fields Using Direct Chemical Perception. Journal of Chemical Theory and Computation, 14(11):6076–6092, November 2018.

[83] Junmei Wang, Romain M. Wolf, James W. Caldwell, Peter A. Kollman, and David A. Case. Development and testing of a general amber force field. Journal of Computational Chemistry, 25(9):1157–1174, July 2004.

[84] S. Böcker, Q. B. A. Bui, and A. Truss. Computing bond orders in molecule graphs. Theoretical Computer Science, 412(12):1184–1195, March 2011.

[85] Scott A. Wildman and Gordon M. Crippen. Prediction of Physicochemical Parameters by Atomic Contributions. Journal of Chemical Information and Computer Sciences, 39(5):868–873, September 1999.

[86] A. Pedretti, L. Villa, and G. Vistoli. Atom-type description language: A universal language to recognize atom types implemented in the VEGA program. Theoretical Chemistry Accounts, 109(4):229–232, May 2003.

[87] Kenny B. Lipkowitz and Donald B. Boyd. Reviews in Computational Chemistry. John Wiley & Sons, September 2009.

117 [88] I. David Brown. What is the best way to determine bond-valence parameters? IUCrJ, 4(Pt 5):514–515, September 2017. [89] Zhao Jin, Chunwei Yang, Fenglei Cao, Feng Li, Zhifeng Jing, Long Chen, Zhe Shen, Liang Xin, Sijia Tong, and Huai Sun. Hierarchical atom type definitions and extensible all-atom force fields. Journal of Computational Chemistry, 37(7):653–664, 2016. [90] Bruce L. Bush and Robert P. Sheridan. PATTY: A programmable atom type and language for automatic classification of atoms in molecular databases. Journal of Chemical Information and Computer Sciences, 33(5):756–762, September 1993. [91] OpenEye Toolkits 2018.Feb. OpenEye Scientific Software, 2018. [92] Gilles Marcou and Didier Rognan. Optimizing Fragment and Scaffold Docking by Use of Molecular Interaction Fingerprints. Journal of Chemical Information and Modeling, 47(1):195–207, January 2007. [93] Harald Mauser and Martin Stahl. Chemical Fragment Spaces for de novo Design. Journal of Chemical Information and Modeling, 47(2):318–324, March 2007. [94] Peter J Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711–732, 1995. [95] Kurt Binder and Dieter Heermann. Monte Carlo Simulation in Statistical Physics: An Introduction. Graduate Texts in Physics. Springer-Verlag, Berlin Heidelberg, 5 edition, 2010. [96] Kathryn Farrell, J. Tinsley Oden, and Danial Faghihi. A Bayesian framework for adaptive selection, calibration, and validation of coarse-grained models of atomistic systems. Journal of Computational Physics, 295:189–208, August 2015. [97] Andrew L. Ferguson. BayesWHAM: A Bayesian approach for free energy estimation, reweighting, and uncertainty quantification in the weighted histogram analysis method. Journal of Computational Chemistry, 38(18):1583–1605, July 2017. [98] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. Equation of State Calculations by Fast Computing Ma- chines. The Journal of Chemical Physics, 21(6):1087–1092, June 1953. [99] Dirk P. Kroese, Tim Brereton, Thomas Taimre, and Zdravko I. Botev. Why the Monte Carlo method is so important today. Wiley Interdisciplinary Reviews: Computational Statistics, 6(6):386–392, November 2014. [100] Reinhard Diestel. Graph Theory. Springer, 5th edition, 2016. [101] Frank Jensen. Introduction to Computational Chemistry. John Wiley & Sons, Novem- ber 2016. [102] Zvi Galil. Efficient Algorithms for Finding Maximum Matching in Graphs. ACM Comput. Surv., 18(1):23–38, March 1986.

118 [103] Fred Glover. Maximum matching in a convex bipartite graph. Naval Research Logistics Quarterly, 14(3):313–316, January 1967.

[104] Christopher I. Bayly, Caitlin C. Bannan, and David L. Mobley. smirnoff99Frosst (DOI 10.5281/zendo.154555), 2016.

[105] Junmei Wang, Piotr Cieplak, and Peter A. Kollman. How well does a restrained electrostatic potential (RESP) model perform in calculating conformational energies of organic and biological molecules? Journal of Computational Chemistry, 21(12):1049– 1074, September 2000. parm99.

[106] CI Bayly, D McKay, and JF Truchon. An Informal AMBER Small Molecule Force Field: Parm@ Frosst. http://www.ccl.net/cca/data/parm_at_Frosst/, 2010.

[107] David S. Wishart, Craig Knox, An Chi Guo, Savita Shrivastava, Murtaza Hassanali, Paul Stothard, Zhan Chang, and Jennifer Woolsey. DrugBank: A comprehensive re- source for in silico drug discovery and exploration. Nucleic Acids Research, 34(Database issue):D668–672, January 2006.

[108] David S. Wishart, Craig Knox, An Chi Guo, Dean Cheng, Savita Shrivastava, Dan Tzur, Bijaya Gautam, and Murtaza Hassanali. DrugBank: A knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Research, 36(Database issue):D901–906, January 2008.

[109] Craig Knox, Vivian Law, Timothy Jewison, Philip Liu, Son Ly, Alex Frolkis, Allison Pon, Kelly Banco, Christine Mak, Vanessa Neveu, Yannick Djoumbou, Roman Eisner, An Chi Guo, and David S. Wishart. DrugBank 3.0: A comprehensive resource for ’omics’ research on drugs. Nucleic Acids Research, 39(Database issue):D1035–1041, January 2011.

[110] Vivian Law, Craig Knox, Yannick Djoumbou, Tim Jewison, An Chi Guo, Yifeng Liu, Adam Maciejewski, David Arndt, Michael Wilson, Vanessa Neveu, Alexandra Tang, Geraldine Gabriel, Carol Ly, Sakina Adamjee, Zerihun T. Dame, Beomsoo Han, You Zhou, and David S. Wishart. DrugBank 4.0: Shedding new light on drug metabolism. Nucleic Acids Research, 42(Database issue):D1091–1097, January 2014.

[111] David L. Veenstra, David M. Ferguson, and Peter A. Kollman. How transferable are hydrogen parameters in molecular mechanics calculations? Journal of Computational Chemistry, 13(8):971–978, October 1992.

[112] Joseph A. DiMasi, Henry G. Grabowski, and Ronald W. Hansen. Innovation in the pharmaceutical industry: New estimates of R&D costs. Journal of Health Economics, 47:20–33, May 2016.

[113] Theodora Katsila, Georgios A. Spyroulias, George P. Patrinos, and Minos-Timotheos Matsoukas. Computational approaches in target identification and drug discovery. Computational and Structural Biotechnology Journal, 14:177–184, January 2016.

119 [114] Lu Zhang, Jianjun Tan, Dan Han, and Hao Zhu. From machine learning to deep learning: Progress in machine intelligence for rational drug discovery. Drug Discovery Today, 22(11):1680–1685, November 2017.

[115] David L Mobley, Christopher I Bayly, Matthew D Cooper, Michael R Shirts, and Ken A Dill. Small molecule hydration free energies in explicit solvent: An extensive test of fixed-charge atomistic simulations. Journal of chemical theory and computation, 5(2):350–358, February 2009.

[116] Pascal Bonnet and Richard A Bryce. Molecular dynamics and free energy analysis of neuraminidase-ligand interactions. Protein science : a publication of the Protein Society, 13(4):946–57, April 2004.

[117] Jianwei Che, Joachim Dzubiella, Bo Li, and J. Andrew McCammon. Electrostatic Free Energy and Its Variations in Implicit Solvent Models. The Journal of Physical Chemistry B, 112(10):3058–3069, March 2008.

[118] Pietro Cozzini, Micaela Fornabaio, Anna Marabotti, Donald J Abraham, Glen E Kel- logg, and Andrea Mozzarelli. Free energy of ligand binding to protein: Evaluation of the contribution of water molecules by computational methods. Current medicinal chemistry, 11:3093–3118, 2004.

[119] Alexander Hillisch, Nikolaus Heinrich, and Hanno Wild. Computational Chemistry in the Pharmaceutical Industry: From Childhood to Adolescence. ChemMedChem, 10(12):1958–1962, 2015.

[120] Pavel V Klimovich, Michael R Shirts, and David L Mobley. Guidelines for the analysis of free energy calculations. Journal of computer-aided molecular design, March 2015.

[121] Peter. Kollman. Free energy calculations: Applications to chemical and biochemical phenomena. Chemical Reviews, 93(7):2395–2417, November 1993.

[122] X. Hu, I. Maffucci, and A. Contini. Advances in the treatment of explicit water molecules in docking and binding free energy calculations. Current medicinal chemistry, May 2018.

[123] Maria M. Reif and Martin Zacharias. Rapid approximate calculation of water bind- ing free energies in the whole hydration domain of (bio)macromolecules. Journal of Computational Chemistry, 37(18):1711–1724, 2016.

[124] Julien Michel and Jonathan W. Essex. Prediction of protein-ligand binding affinity by free energy simulations: Assumptions, pitfalls and expectations. Journal of Computer- Aided Molecular Design, 24:639–658, 2010.

[125] Devleena Shivakumar, Yuqing Deng, and Benoît Roux. Computations of Absolute Solvation Free Energies of Small Molecules Using Explicit and Implicit Solvent Model. Journal of Chemical Theory and Computation, 5(4):919–930, April 2009.

120 [126] Linda Yu Zhang, Emilio Gallicchio, Richard A. Friesner, and Ronald M. Levy. Solvent models for protein–ligand binding: Comparison of implicit solvent poisson and surface generalized born models with explicit solvent simulations. Journal of Computational Chemistry, 22(6):591–607, 2001.

[127] Alexey Onufriev. Chapter 7 - Implicit Solvent Models in Molecular Dynamics Simu- lations: A Brief Overview. In Ralph A. Wheeler and David C. Spellmeyer, editors, Annual Reports in Computational Chemistry, volume 4, pages 125–137. Elsevier, Jan- uary 2008.

[128] John D Chodera and David L Mobley. Entropy-enthalpy compensation: Role and ram- ifications in biomolecular ligand recognition and design. Annual review of biophysics, 42:121–42, 2013.

[129] Sergio Decherchi, Matteo Masetti, Ivan Vyalov, and Walter Rocchia. Implicit solvent methods for free energy estimation. European journal of medicinal chemistry, 0:27–42, February 2015.

[130] Thomas Gaillard and David A. Case. Evaluation of DNA Force Fields in Implicit Solvation. Journal of Chemical Theory and Computation, 7(10):3181–3198, October 2011.

[131] Christopher J. Cramer and Donald G. Truhlar. Implicit Solvation Models: Equilibria, Structure, Spectra, and Dynamics. Chemical Reviews, 99(8):2161–2200, August 1999.

[132] Brian N. Dominy and Charles L. Brooks. Development of a Generalized Born Model Parametrization for Proteins and Nucleic Acids. The Journal of Physical Chemistry B, 103(18):3765–3773, May 1999.

[133] Jens Kleinjung and Franca Fraternali. Design and application of implicit solvent models in biomolecular simulations. Current Opinion in Structural Biology, 25:126–134, 2014.

[134] Martin Brieg, Julia Setzler, Steffen Albert, and Wolfgang Wenzel. Generalized Born implicit solvent models for small molecule hydration free energies. Physical Chemistry Chemical Physics, 19(2):1677–1685, 2017.

[135] Berk Hess and Nico F a van der Vegt. Hydration thermodynamic properties of amino acid analogues: A systematic comparison of biomolecular force fields and water models. The journal of physical chemistry. B, 110:17616–17626, 2006.

[136] Gerhard Hummer, Lawrence R. Pratt, and Angel E. Garcia. Hydration free energy of water. The Journal of Physical Chemistry, 99(38):14188–14194, September 1995.

[137] B. Jayaram, D. Sprous, and D. L. Beveridge. Solvation Free Energy of Biomacro- molecules: Parameters for a Modified Generalized Born Model Consistent with the AMBER Force Field. The Journal of Physical Chemistry B, 102(47):9571–9576, November 1998.

121 [138] Yuk Yin Sham, Zhen Tao Chu, and Arieh Warshel. Consistent Calcu- lations of p textlessitextgreaterKtextless/itextgreater textlesssubtextgreateratext- less/subtextgreater ’s of Ionizable Residues in Proteins: Semi-microscopic and Mi- croscopic Approaches. The Journal of Physical Chemistry B, 101(22):4458–4472, May 1997.

[139] Jessica M. J. Swanson, John Mongan, and J. Andrew McCammon. Limitations of Atom-Centered Dielectric Functions in Implicit Solvent Models. The Journal of Phys- ical Chemistry B, 109(31):14769–14772, August 2005.

[140] Jason Wagoner and Nathan A. Baker. Solvation forces on biomolecular structures: A comparison of explicit solvent and Poisson–Boltzmann models. Journal of Computa- tional Chemistry, 25(13):1623–1629, 2004.

[141] Jacob Kongsted, Pär Söderhjelm, and Ulf Ryde. How accurate are continuum solva- tion models for drug-like molecules? Journal of Computer-Aided Molecular Design, 23(7):395–409, July 2009.

[142] Michal Kolář, Jindřich Fanfrĺik, Martin Lepˇ´sik, Flavio Forti, F. Javier Luque, and Pavel Hobza. Assessing the Accuracy and Performance of Implicit Solvent Models for Drug Molecules: Conformational Ensemble Approaches. The Journal of Physical Chemistry B, 117(19):5950–5962, May 2013.

[143] Michael K. Gilson and Barry Honig. Calculation of the total electrostatic energy of a macromolecular system: Solvation energies, binding energies, and conformational analysis. Proteins: Structure, Function, and Genetics, 4(1):7–18, 1988.

[144] B. Honig and A. Nicholls. Classical electrostatics in biology and chemistry. Science, 268(5214):1144–1149, May 1995.

[145] Alexey Onufriev, David A. Case, and Donald Bashford. Effective Born radii in the generalized Born approximation: The importance of being perfect. Journal of Com- putational Chemistry, 23(14):1297–1304, November 2002.

[146] Benoît Roux and Thomas Simonson. Implicit solvent models. Biophysical Chemistry, 78:1–20, 1999.

[147] Philippe Ferrara, Joannis Apostolakis, and Amedeo Caflisch. Evaluation of a fast im- plicit solvent model for molecular dynamics simulations. Proteins: Structure, Function, and Bioinformatics, 46(1):24–33, 2002.

[148] Paul Labute. The generalized Born/volume integral implicit solvent model: Estimation of the free energy of hydration using London dispersion instead of atomic surface area. Journal of Computational Chemistry, 29(10):1693–1698, 2008.

[149] Wonpil Im, Dmitrii Beglov, and Benoît Roux. Continuum solvation model: Computa- tion of electrostatic forces from numerical solutions to the Poisson-Boltzmann equation. Computer Physics Communications, 111(1):59–75, June 1998.

122 [150] Yufeng Liu, Esmael Haddadian, Tobin R Sosnick, Karl F Freed, and Haipeng Gong. A novel implicit solvent model for simulating the molecular dynamics of RNA. Biophysical journal, 105(5):1248–57, September 2013.

[151] Donald Bashford and David A. Case. Generalized Born Models of Macromolecular Solvation Effects. Annual Review of Physical Chemistry, 51(1):129–152, October 2000.

[152] Alexey Onufriev. The generalized Born model: Its foundation, applications, and limi- tations. page 52, 2010.

[153] Alexey Onufriev, Donald Bashford, and David a. Case. Exploring Protein Native States and Large-Scale Conformational Changes with a Modified Generalized Born Model. Proteins: Structure, Function and Genetics, 55(May 2003):383–394, 2004.

[154] Nathan A Baker. Improving implicit solvent simulations: A Poisson-centric view. Current Opinion in Structural Biology, 15(2):137–143, April 2005.

[155] Bojan Zagrovic, Eric J Sorin, and Vijay Pande. β-hairpin folding simulations in atom- istic detail using an implicit solvent model11Edited by F. Cohen. Journal of Molecular Biology, 313(1):151–169, October 2001.

[156] Libo Li, Ken A. Dill, and Christopher J. Fennell. Testing the semi-explicit assembly model of aqueous solvation in the SAMPL4 challenge. Journal of Computer-Aided Molecular Design, 28(3):259–264, March 2014.

[157] Shen Li and Philip Bradley. Probing the role of interfacial waters in protein–DNA recognition using a hybrid implicit/explicit solvation model. Proteins: Structure, Func- tion, and Bioinformatics, 81(8):1318–1329, 2013.

[158] Gerhard König, Frank C. Pickard, Ye Mei, and Bernard R. Brooks. Predicting hy- dration free energies with a hybrid QM/MM approach: An evaluation of implicit and explicit solvation models in SAMPL4. Journal of Computer-Aided Molecular Design, 28(3):245–257, March 2014.

[159] Hiraku Oshima and Masahiro Kinoshita. A highly efficient hybrid method for cal- culating the hydration free energy of a protein. Journal of computational chemistry, November 2015.

[160] Michael S. Lee, Freddie R. Salsbury, and Mark A. Olson. An efficient hybrid ex- plicit/implicit solvent method for biomolecular simulations. Journal of Computational Chemistry, 25(16):1967–1978, December 2004.

[161] Libo Li, Christopher J Fennell, and Ken A Dill. Field-SEA: A model for computing the solvation free energies of nonpolar, polar, and charged solutes in water. The journal of physical chemistry. B, 118(24):6431–7, June 2014.

[162] Christopher J. Fennell, Charles W. Kehoe, and Ken A. Dill. Modeling aqueous sol- vation with semi-explicit assembly. Proceedings of the National Academy of Sciences, 108(8):3234–3239, February 2011.

123 [163] Some collected tools for molecular simulation pipelines: Choderalab/mmtools. Chodera lab // Memorial Sloan Kettering Cancer Center, January 2019.

[164] Mark James Abraham, Teemu Murtola, Roland Schulz, Szilárd Páll, Jeremy C. Smith, Berk Hess, and Erik Lindahl. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX, 1:19–25, 2015.

[165] Gregory D. Hawkins, Christopher J. Cramer, and Donald G. Truhlar. Parametrized Models of Aqueous Free Energies of Solvation Based on Pairwise Descreening of So- lute Atomic Charges from a Dielectric Medium. The Journal of Physical Chemistry, 100(51):19824–19839, January 1996.

[166] Charles H Bennett. Efficient estimation of free energy differences from Monte Carlo data. Journal of Computational Physics, 22(2):245–268, October 1976.

[167] David L Mobley, Janene R Baker, Alan E Barber, Christopher J Fennell, and Ken A Dill. Charge asymmetries in hydration of polar solutes. The journal of physical chem- istry. B, 112(8):2405–14, February 2008.

[168] David L Mobley, Ken A Dill, and John D Chodera. Treating entropy and confor- mational changes in implicit solvent simulations of small molecules. The journal of physical chemistry. B, 112(3):938–46, January 2008.

[169] AMBER 2015, 2015.

[170] Romelia Salomon-Ferrer, David A. Case, and Ross C. Walker. An overview of the Amber biomolecular simulation package: Amber biomolecular simulation package. Wi- ley Interdisciplinary Reviews: Computational Molecular Science, 3(2):198–210, March 2013.

[171] Bill R. Miller, T. Dwight McGee, Jason M. Swails, Nadine Homeyer, Holger Gohlke, and Adrian E. Roitberg. MMPBSA.py : An Efficient Program for End-State Free Energy Calculations. Journal of Chemical Theory and Computation, 8(9):3314–3321, September 2012.

[172] Peter Eastman and Vijay S. Pande. OpenMM: A Hardware Independent Framework for Molecular Simulations. Computing in science & engineering, 12(4):34–39, July 2015.

[173] Gregory D. Hawkins, Christopher J. Cramer, and Donald G. Truhlar. Pairwise solute descreening of solute charges from a dielectric medium. Chemical Physics Letters, 246(1):122–129, November 1995.

[174] Boris Aguilar, Richard Shadrach, and Alexey V. Onufriev. Reducing the Secondary Structure Bias in the Generalized Born Model via R6 Effective Radii. Journal of Chemical Theory and Computation, 6(12):3613–3630, December 2010.

124 [175] Saeed Izadi, Boris Aguilar, and Alexey V. Onufriev. Protein–Ligand Electrostatic Binding Free Energies from Explicit and Implicit Solvation. Journal of Chemical The- ory and Computation, 11(9):4450–4459, September 2015.

[176] Saeed Izadi, Robert C. Harris, Marcia O. Fenley, and Alexey V. Onufriev. Accuracy Comparison of Generalized Born Models in the Calculation of Electrostatic Binding Free Energies. Journal of Chemical Theory and Computation, 14(3):1656–1670, March 2018.

[177] Zef A. Könst, Anne R. Szklarski, Simone Pellegrino, Sharon E. Michalak, Mélanie Meyer, Camila Zanette, Regina Cencic, Sangkil Nam, Vamsee K. Voora, David A. Horne, Jerry Pelletier, David L. Mobley, Gulnara Yusupova, Marat Yusupov, and Christopher D. Vanderwal. Synthesis facilitates an understanding of the structural basis for translation inhibition by the lissoclimides. Nature Chemistry, 9(11):1140– 1149, November 2017.

[178] Simone Pellegrino, Mélanie Meyer, Zef A. Könst, Mikael Holm, Vamsee K. Voora, Daniya Kashinskaya, Camila Zanette, David L. Mobley, Gulnara Yusupova, Chris D. Vanderwal, Scott C. Blanchard, and Marat Yusupov. Understanding the role of in- termolecular interactions between lissoclimides and the eukaryotic ribosome. Nucleic Acids Research, 2019.

[179] Bruce Alberts. The Cell as a Collection of Protein Machines: Preparing the Next Generation of Molecular Biologists. Cell, 92(3):291–294, 1998.

[180] Alexander S Spirin. Ribosome as a molecular machine. FEBS Letters, 514(1):2–10, March 2002.

[181] B Weisblum and J Davies. Antibiotic inhibitors of the bacterial ribosome. Bacterio- logical reviews, 32(4 Pt 2):493–528, December 1968.

[182] Jacob Poehlsgaard and Stephen Douthwaite. The bacterial ribosome as a target for antibiotics. Nature reviews. Microbiology, 3(11):870–81, November 2005.

[183] S Kawai, S Murao, M Mochizuki, I Shibuya, K Yano, and M Takagi. Drastic alteration of cycloheximide sensitivity by substitution of one amino acid in the L41 ribosomal protein of yeasts. Journal of bacteriology, 174(1):254–62, January 1992.

[184] T G Obrig, W J Culp, W L McKeehan, and B Hardesty. The mechanism by which cycloheximide and related glutarimide antibiotics inhibit peptide synthesis on reticu- locyte ribosomes. The Journal of biological chemistry, 246(1):174–81, January 1971.

[185] L. A. Kominek. Cycloheximide Production by Streptomyces griseus: Control Mech- anisms of Cycloheximide Biosynthesis. Antimicrobial Agents and Chemotherapy, 7(6):856–860, June 1975.

[186] Koko Sugawara, Yuji Nishiyama, Soichiro Toda, Nobujiro Komiyama, Masami Hatori, Toshio Moriyama, Yosuke Sawada, Hideo Kamei, Masataka Konishi, and Toshikazu

125 Oki. Lactimidomycin, a new glutarimide group antibiotic. Production, isolation, struc- ture and biological activity. The Journal of Antibiotics, 45(9):1433–1441, 1992.

[187] Tilman Schneider-Poetsch, Jianhua Ju, Daniel E Eyler, Yongjun Dang, Shridhar Bhat, William C Merrick, Rachel Green, Ben Shen, and Jun O Liu. Inhibition of eukaryotic translation elongation by cycloheximide and lactimidomycin. Nature Chemical Biology, 6(3):209–217, 2010.

[188] Dennis M Krüger, Gisela Jessen, and Holger Gohlke. How good are state-of-the-art docking tools in predicting ligand binding modes in protein-protein interfaces? Journal of chemical information and modeling, 52(11):2807–11, November 2012.

[189] Ajay N Jain and Anthony Nicholls. Recommendations for evaluation of computational methods. Journal of computer-aided molecular design, 22(3-4):133–9, January.

[190] Paul S. Charifson, Joseph J. Corkery, Mark A. Murcko, and W. Patrick Walters. Consensus Scoring:~ A Method for Obtaining Improved Hit Rates from Docking Databases of Three-Dimensional Structures into Proteins. Journal of Medicinal Chem- istry, 42(25):5100–5109, December 1999.

[191] Douglas R Houston and Malcolm D Walkinshaw. Consensus docking: Improving the reliability of docking in a virtual screening context. Journal of chemical information and modeling, 53(2):384–90, February 2013.

[192] Oleg Trott and Arthur J Olson. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2):455–61, January 2010.

[193] OpenEye Scientific Software. ROCS, 2014.

[194] LLC Schr\"odinger. Maestro, 2014.

[195] Mark McGann. FRED pose prediction and virtual screening accuracy. Journal of chemical information and modeling, 51(3):578–96, March 2011.

[196] Mark McGann. FRED and HYBRID docking performance on standardized datasets. Journal of computer-aided molecular design, 26(8):897–906, August 2012.

[197] Thomas S Rush, J Andrew Grant, Lidia Mosyak, and Anthony Nicholls. A shape- based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. Journal of medicinal chemistry, 48(5):1489–95, March 2005.

[198] Andreas Bender and Robert C. Glen. Molecular similarity: A key technique in molec- ular informatics. Organic & Biomolecular Chemistry, 2(22):3204, November 2004.

[199] Fabien Fontaine, Evan Bolton, Yulia Borodina, and Stephen H Bryant. Fast 3D shape screening of large chemical databases through alignment-recycling. Chemistry Central journal, 1:12, January 2007.

126 [200] Tiziano Tuccinardi, Giulio Poli, Veronica Romboli, Antonio Giordano, and Adriano Martinelli. Extensive consensus docking evaluation for ligand pose prediction and virtual screening studies. Journal of chemical information and modeling, 54(10):2980– 6, October 2014.

[201] Arnout R D Voet, Ashutosh Kumar, Francois Berenger, and Kam Y J Zhang. Com- bining in silico and in cerebro approaches for virtual screening and pose prediction in SAMPL4. Journal of computer-aided molecular design, 28(4):363–73, April 2014.

[202] Neil Kumar, Bart S. Hendriks, Kevin A. Janes, David de Graaf, and Douglas A. Lauffenburger. Applying computational modeling to drug discovery and development. Drug Discovery Today, 11(17):806–811, September 2006.

[203] Ryan G. Coleman, Teague Sterling, and Dahlia R. Weiss. SAMPL4 & DOCK3.7: Lessons for automated docking procedures. Journal of computer-aided molecular de- sign, 28(3):201–209, March 2014.

[204] A.-P. Magiorakos, A. Srinivasan, R.B. Carey, Y. Carmeli, M.E. Falagas, C.G. Giske, S. Harbarth, J.F. Hindler, G. Kahlmeter, B. Olsson-Liljequist, D.L. Paterson, L.B. Rice, J. Stelling, M.J. Struelens, A. Vatopoulos, J.T. Weber, and D.L. Monnet. Multidrug-resistant, extensively drug-resistant and pandrug-resistant bacteria: An in- ternational expert proposal for interim standard definitions for acquired resistance. Clinical Microbiology and Infection, 18(3):268–281, March 2012.

[205] Stephanie K Aoki, Rupinderjit Pamma, Aaron D Hernday, Jessica E Bickham, Bruce A Braaten, and David A Low. Contact-dependent inhibition of growth in Escherichia coli. Science (New York, N.Y.), 309(5738):1245–8, August 2005.

[206] Stephanie K Aoki, Elie J Diner, Claire T’kint de Roodenbeke, Brandt R Burgess, Stephen J Poole, Bruce A Braaten, Allison M Jones, Julia S Webb, Christopher S Hayes, Peggy A Cotter, and David A Low. A widespread family of polymorphic contact-dependent toxin delivery systems in bacteria. Nature, 468(7322):439–42, November 2010.

[207] Stephen J Poole, Elie J Diner, Stephanie K Aoki, Bruce A Braaten, Claire t’Kint de Roodenbeke, David A Low, and Christopher S Hayes. Identification of functional toxin/immunity genes linked to contact-dependent growth inhibition (CDI) and rear- rangement hotspot (Rhs) systems. PLoS genetics, 7(8):e1002217, August 2011.

[208] Robert P Morse, Kiel C Nikolakakis, Julia L E Willett, Elias Gerrick, David A Low, Christopher S Hayes, and Celia W Goulding. Structural basis of toxicity and immunity in contact-dependent growth inhibition (CDI) systems. Proceedings of the National Academy of Sciences of the United States of America, 109(52):21480–5, December 2012.

[209] Zachary C Ruhe, David A Low, and Christopher S Hayes. Bacterial contact-dependent growth inhibition. Trends in microbiology, 21(5):230–7, May 2013.

127 [210] LLC Schr\"odinger. The {PyMOL} Molecular Graphics System, Version~1.7. Schr\"odinger, LLC, 2015.

[211] Jay W. Ponder and David A. Case. Force fields for protein simulations. Advances in Protein Chemistry, 66:27–85, 2003.

[212] Alan W. Sousa da Silva and Wim F. Vranken. ACPYPE - AnteChamber PYthon Parser interfacE. BMC Research Notes, 5(1):367, July 2012.

128 Appendix A

Appendix - Progress toward the comprehensive computational design of a macrocyclic peptide

Algorithm 1 Macrocyclic peptide auto run #!/bin/sh pdb_file=mut4c_.pdb prefix=mut4c python fix_leapscript.py -i peptide2.leap.scrpt -p $pdb_file tleap -f peptide.leap.scrpt acpype.py -p “$prefix”.prmtop -x “$prefix”.inpcrd -r sufix=“_GMX” name_file=$prefix$sufix echo $name_file editconf -f “$name_file”.gro -bt dodecahedron -d 0.5 -o box.gro genbox -cp box.gro -cs spc216.gro -p “$name_file”.top -o “$name_file”.gro python fix_topfile.py -i “$name_file”.top -o “$name_file”.top python fix_runfile.py -i prun3.sh -t “$name_file”.top sbatch prun.sh

129 Algorithm 2 Macrocyclic peptide analysis PREFIX=“mol2” trjconv -s equil_nvt.tpr -f prod.trr -o prod_nowater.trr -pbc nojump «EOF 1 EOF tpbconv -s equil_nvt.tpr -o equil_nowater.tpr «EOF 1 EOF trjconv -s equil_nowater.tpr -f prod_nowater.trr -o prod_nowater_fit.trr -fit rottrans «EOF 4 1 EOF g_rms -s equil_nowater.tpr -f prod_nowater_fit.trr -o rmsd_“$PREFIX”_bb.xvg «EOF 4 4 EOF g_rmsf -s equil_nowater.tpr -f prod_nowater_fit.trr -o rmsf_“$PREFIX”_res.xvg -res «EOF 1 EOF ## do_dssp -f prod_nowater_fit.trr -s equil_nowater.tpr -sc dssp_“$PREFIX”.xvg -o ss.xpm -dt 10

130