University of South Florida Scholar Commons

Graduate Theses and Dissertations Graduate School

4-15-2015 Development and application of web-based open source drug discovery platforms Yuri Pevzner University of South Florida, [email protected]

Follow this and additional works at: https://scholarcommons.usf.edu/etd Part of the Chemistry Commons

Scholar Commons Citation Pevzner, Yuri, "Development and application of web-based open source drug discovery platforms" (2015). Graduate Theses and Dissertations. https://scholarcommons.usf.edu/etd/5550

This Dissertation is brought to you for free and open access by the Graduate School at Scholar Commons. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Scholar Commons. For more information, please contact [email protected].

Development and Application of Web-Based Open Source Drug Discovery Platforms

by

Yuri Pevzner

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Chemistry with a concentration in Computational Chemistry Department of Chemistry College of Arts and Sciences University of South Florida

Co-Major Professor: Henry Lee Woodcock, Ph.D. Co-Major Professor: Wayne C. Guida, Ph.D. Brian Space, Ph.D. Kenyon G. Daniel, Ph.D.

Date of Approval: April 15, 2015

Keywords: CADD, Docking, Virtual Target Screening, CHARMMing

Copyright © 2015, Yuri Pevzner

DEDICATION

I dedicate this work to my mother. Although many people have been instrumental in helping me in this achievement, she alone has seen me through the entire journey that has culminated in this work. From filling my childhood with love and happiness, to giving much needed guidance and encouragement to the developing individual, to having complete faith and confidence in every step I’ve taken since, she has been there. Knowingly or otherwise, she has instilled in me every trait that I value about myself, every trait that defines who I am.

Mom, I want to thank you for being there through time and distance, thank you for making home the warm and welcoming place it has always been. Thank you for being my dearest friend. Happy birthday mom, this one is for you!

ACKNOWLEDGMENTS

I would like to thank my major advisors Dr. H. Lee Woodcock and Dr. Wayne C. Guida.

Dr. Woodcock’s energy and infectious excitement about science was what had helped me stay productive and enthusiastic about my work throughout the long and arduous years of studies. Dr.

Guida’s wisdom and valuable guidance will serve me well for many years, as in my eyes, he has always been an epitome of a scientist and a teacher.

I would also like to thank Dr. Brian Space and Dr. Kenyon G. Daniel. Dr. Space for taking his time and patience to teach me the most important concepts of the physics of chemistry and giving me the tools and the confidence needed to navigate the waters of science. Dr. Daniel for instilling into me the appreciation for biology and, most importantly, for having a much appreciated faith in my scientific abilities and decisions.

I would also like to thank the University of South Florida department of chemistry. The faculty and students that have helped create fun and stimulating environment. The administrative staff, in particular Ms. Adrienne McCain, for helping me navigate the administrative requirements, staying on track with the program and, finally, graduating.

TABLE OF CONTENTS

List of Tables ...... iii

List of Figures ...... iv

Abstract ...... vi

Chapter One: Introduction ...... 1 The Need for Web-Based Open Source Drug Discovery Platforms...... 1 Challenges of Open Source Projects ...... 4 List of References ...... 7

Chapter Two: Development of the CHARMMing Web User Interface as a Platform for Computer-Aided Drug Design ...... 9 Note to Reader ...... 9 CHARMM Interface and Graphics Platform ...... 9 Fragment-Based Docking Protocol in CHARMMing ...... 11 Implementation Details ...... 11 Target Preparation ...... 11 Compound Library and Ligand Upload ...... 11 Binding Site Definition ...... 12 Docking Protocol ...... 15 Job Submission and Monitoring ...... 19 Performance and Local Execution ...... 19 Results and Discussion ...... 22 Conclusion ...... 26 Acknowledgments...... 27 List of References ...... 28

Chapter Three: Development and Application of the Virtual Target Screening Protocol ...... 33 Development of the Virtual Target Screening Server ...... 33 Methodology Background ...... 33 VTS Protocol ...... 34 VTS Web-Based Interface ...... 35 Application of VTS to Identify Protein Targets of Natural Products in Drug Discovery .38 Natural Products and VTS ...... 38 Materials and Methods ...... 39 Virtual Compound Libraries ...... 39 Ligand Preparation ...... 40 i

Virtual Target Screening ...... 41 Hardware ...... 41 Graphics and Modeling ...... 42 Results ...... 42 VTS on Natural Products ...... 42 VTS on CDDI Natural Products ...... 45 Discussion ...... 48 Conclusion ...... 53 Using Molecular Modeling, Chemo- and Bioinformatics to Search for Biomolecular Targets of Vitamin E δ-tocotrienol ...... 53 Introduction ...... 53 Methods...... 57 VEDT Preparation ...... 57 Virtual Target Screening ...... 57 PharmMapper ...... 57 PASS (Prediction of Activity Spectra of Substances) ...... 58 ProBiS (Protein Binding Site Detection) ...... 59 Results ...... 59 Discussion ...... 64 Consensus between Computational Studies...... 64 Consistency with Experimental Data ...... 65 List of References ...... 66

Chapter Four: CHARMMing Drug Design and Additional Modules Development .....74 CHARMMing Drug Design Database ...... 74 Database Design Considerations...... 74 Dynamic Relationship Schema ...... 76 CHARMMing Fragment-Based de novo Drug Design Module ...... 79 Introduction ...... 79 Fragment-Based de novo Design Protocol Implementation in CHARMMing ...... 81 CHARMMing QSAR/SAR Module ...... 84 Introduction ...... 84 (Q)SAR Functionality of CHARMMing ...... 87 Acknowledgments...... 90 Computational Chemistry Education with CHARMMing ...... 90 Introduction ...... 90 Design and Implementation of CHARMMing Lessons Module ...... 91 Conclusion ...... 95 List of References ...... 96

Appendix A Permission to Print CHARMMing Fragment-Based Docking Paper ...... 101 Permission to Print VTS Paper ...... 102 Permission to Print CHARMMing (Q)SAR Paper ...... 103

ii

LIST OF TABLES

Table 1. RMSDs of the docking poses generated by CHARMMing’s fragment-based docking protocol and Glide SP and the success rates ...... 25

Table 2. Top NCI natural products screen hits ...... 43

Table 3. Top CDDI set hits...... 46

Table 4. VEDT and δ-tocopherol VTS results...... 59

Table 5. Results of the ProBis query based on the PDB ID 1NDE input structure...... 62

Table 6. PASS prediction results ...... 64

Table 7. List of predictive models available through CHARMMing ...... 89

iii

LIST OF FIGURES

Figure 1. The “Ligand Set Details” page allows the user to manage custom ligand sets...... 13

Figure 2. The “Submit Docking Job” page ...... 14

Figure 3. The binding site definition page ...... 15

Figure 4. Schematic depiction of fragment-based docking protocol implemented into CHARMMing...... 17

Figure 5. The “Job Details” page ...... 20

Figure 6. Parallelization of the docking protocol ...... 22

Figure 7. Example use of the JChemPaint applet embedded into the gVTS...... 37

Figure 8. Illustration of VTS screen results page ...... 38

Figure 9. Daunorubicin docked into Lck. Daunorubicin’s docked pose according to the VTS docking procedure ...... 45

Figure 10. 368 docked into HCK ...... 49

Figure 11. Comparison of the structures of δ-tocopherol and δ-tocotrienol...... 56

Figure 12. Docked pose of VEDT as identified by the VTS ...... 62

Figure 13. The triad of tables dictated by traditional database design approach to define an object ...... 77

Figure 14. The layout of the dynamic relationship schema ...... 78

Figure 15. Flow chart of the CHARMMing’s fragment-based de novo design procedure...... 83

Figure 16. CHARMMing’s fragment-based de novo design job submission page ...... 85

Figure 17. CHARMMing’s fragment-based de novo design job details/results page ...... 86

Figure 18. CHARMMing (Q)SAR model training file upload interface ...... 88

iv

Figure 19. CHARMMing (Q)SAR model details screen ...... 89

Figure 20. Examples of lesson progress bar ...... 94

v

ABSTRACT

Computational modeling approaches have lately been earning their place as viable tools in drug discovery. Research efforts more often include computational component and the usage of the scientific is commonplace at more stages of the drug discovery pipeline.

However, as software takes on more responsibility and the computational methods grow more involved, the gap grows between research entities that have the means to maintain the necessary computational infrastructure and those that lack the technical expertise or financial means to obtain and include computational component in their scientific efforts. To fill this gap and to meet the need of many, mainly academic, labs numerous community contributions collectively known as open source projects play an increasingly important role. This work describes design, implementation and application of a set of drug discovery workflows based on the CHARMMing

(CHARMM interface and graphics) web-server. The protocols described herein include docking, virtual target screening, de novo drug design, SAR/QSAR modeling as well as chemical education. The performance of the newly developed workflows is evaluated by applying them to a number of scientific problems that include reproducibility of crystal poses of small molecules in protein-ligand systems, identification of potential targets of a library of natural compounds as well as elucidating molecular targets of a vitamin. The results of these inquiries show that protocols developed as part of this effort perform comparably to commercial products, are able to produce results consistent with the experimental data and can substantially enrich the research efforts of labs with otherwise little or no computational component

vi

CHAPTER ONE:

INTRODUCTION

The Need for Web-Based Open Source Drug Discovery Platforms

The concept of open source application development has emerged with the popularization of the World Wide Web as a platform for technical and scientific collaboration. The ability to communicate, share ideas and individual contributions to achieve a single objective gave rise to group efforts where each participant could advance the common goal independent of their physical location, affiliation or even technical background. Such a concept promotes the participation of people that while having a wide range of skills have a common set of interests or even passions. This allows the participants to contribute “freely”, often without an expectation of monetary reimbursement purely for the sake of advancing the field and serving the community of people with similar interests. The motivation behind such undertakings is especially strong when the project contributes to satisfying a need of significant importance, a need that would be difficult to meet without such a collective effort.

Drug discovery is a field where, at first glance, such an open community effort is an unlikely approach. Historically the cost and effort of developing a new drug has largely confined successes to large pharmaceutical companies or otherwise well-funded research institutions.1

Development and use of computer aided drug design (CADD) techniques has provided numerous benefits to the overall process and has been acknowledged by the award of the Noble Prize in

Chemistry in 2013.2 However the expertise required to create powerful commercial software packages has resulted in high licensing costs,3,4 thus limiting access to academic groups.

1

Fortunately, this trend has started to shift with the emergence of freely available software such, as Autodock5 and several other packages,4 largely developed by the academic computational chemistry community. However, for the most part, these software packages require familiarity with CADD methodologies and are better suited for computer savvy users that are at least comfortable if not familiar with the computational component of drug discovery.6 This has hampered the proliferation of CADD tools into less computationally minded drug discovery labs.

The need for intuitive and easy to use CADD solutions has largely been met by the commercial software companies such as Accelrys, Schrödinger, and others that have incorporated full featured graphical user interfaces (GUI) into their programs.7–9 However, as alluded to above, the cost of these packages is typically prohibitive to academic groups and/or institutions. Further, it has proven increasingly difficult to strike a balance between software that is user-friendly yet incorporates a wide range of advanced functionality and customizability. Another aspect of concern is portability. For example stand-alone software that requires local installation on every computer, may find less use in today’s world1 where researchers expect both the application and the data to be accessible from any machine on any platform from any location.10

Another hurdle, faced by the non-expert, to incorporating computational modeling into drug discovery efforts is the difficulty of obtaining reliable small molecule parameters.11–13 Most widely used and well tested force fields have been developed with proteins and nucleic acids rather than small molecules in mind14. Until recently this has meant that drug-like molecule parameters have been less reliable, with assignment often arbitrary. Lately, however, there has been a significant amount of effort devoted to improving the reliability of small molecule parameters and developing efficient protocols to generate them for a much greater and more diverse chemical space11,12,14,15.

2

The applications development of which is described in this work have been designed with several criteria in mind:

1. Incorporation of a range of drug discovery methods to aid researchers at different

points in the drug discovery process. Such methods include molecular modeling and

simulations, docking, de-novo drug design and chemoinformatics.

2. All the data produced by the applications or uploaded by the user remains accessible

indefinitely and is tied to the user account and secured by the verification of login

credentials. All the jobs run independently and do not require active user session. The

results of user jobs along with all the input and output file are stored in the database

and are available for access at any time in the future.

3. Pre-loaded public compound library is available for docking and virtual screening

jobs in addition to user supplied compounds.

4. Integration with outside resources such as protein, compound and assay data

repositories.

5. Usable by computational novices and advanced users alike.

Primary users targeted by the application include small academic labs and any other labs that don’t have financial means of obtaining commercial drug discovery packages or don’t have the expertise to run and maintain drug discovery applications without having access to computational core facilities. Some more specific examples of potential users are:

 Medicinal chemistry lab that would like to use computational techniques at early stages

of discovery to get lead ideas

 An experimental lab that would like to supplement their findings with additional rationale

based on computational methods

3

 A group that would like to build hypothesis for lead optimization using a set of already

obtained experimental data

Challenges of Open Source Projects

Open source approach to developing drug discovery applications has a number of challenges. Some are inherent to the concept of open source approach itself while others are more specific to the field of computer - aided drug discovery. The very nature of collaborative open source projects means that there are multiple contributors. Aside from the more obvious factors such as different geographical locations, which internet more or less successfully alleviates, there are other issues to consider. It is likely that there may be significant differences between technical backgrounds of the contributors. Since in this case the project largely deals with computer programming, the most evident manifestation of these differences may be the variety of programming languages and development paradigms the contributors may be comfortable with. This presents a two-fold challenge. First, at the earliest stages of the project some common coding framework should be chosen such that new contributors find it easy enough to transition to it even if they come from a different background or in general have less advanced technical skills. Second, as the project progresses the coding standards should be adhered to and kept within the common framework. The most obvious benefits of these steps will be the fact that a new developer will be able to get up to speed quickly, will be able to read and understand the code of other developers, and most importantly will be able to follow the same design and coding patterns. This will help prevent the project from becoming too fragmented over time, allow participants to better integrate their individual contributions into the

4

overall project as well as make maintenance and upgrades easier. The latter is especially important for open-source community efforts as the contributor turnover is usually high.

Another largely technical aspect of any open-source effort is the platforms. In this context platform encompasses such concepts as computer architecture and . Both the platform on which the development is done and on which the final product is served to the end- user are important. Even though most programming languages or development frameworks strive to be platform-independent, in practice platform differences may pose challenges ranging from syntax differences to binaries or compiler incompatibilities. Similarly, packaging a final product as a Windows® executable will likely mean that it will not execute on a Linux based machine.

The end-user platform consideration is particularly sensitive in the age of mobile computing where users expect their applications served on wide variety of mobile devices.

The issue of data and information standards is also an important consideration when it comes to open-source efforts and it is by no means unique to the field of CADD. However in the context of this work it is proper to consider cases specific to drug discovery, storage and representation of chemical information. A wide variety of structural data formats makes for an important consideration when designing an open-source application whose functionality largely hinges on the ability to process chemical structural information. In principle, suitable plfug-ins can over time be developed by community contributors to allow the application to speak any dialect of a chemical language. However a project whose success depends on the community’s acceptance risks alienating its users if a basic support for most common formats is not provided from the beginning. Another aspect that, to a degree, deals with standards is compatibility with other commercial, free or open-source tools available to the community. CADD is a large and growing field. Computer techniques find use in many stages of the drug discovery process

5

ranging from hit finding to lead optimization, toxicology predictions and beyond. In addition there may be multiple computational approaches or software applications to perform similar tasks. With this in mind the open-source application has to be able to both accept as input and produce the output compatible with other programs that the user may potentially utilize as part of their research.

A more practical aspect of developing a scientific application is its ability to crunch a lot of data fast. In other words, performance is very important consideration. From running an extended molecular dynamics simulation on a many atom system to virtually screening a large compound library, the application has to be able to perform such computationally intensive tasks in a reasonable time. In addition the execution pattern has to be able to take advantage of common multi-processor or high performance computing (HPC) architectures in order to spread the computational load among many worker threads.

Another important point to consider when developing an open source application is its ability to be expanded, deployed on outside resources or adopted for a particular need. The level of difficulty associated with these tasks would directly correlate with the application’s pool of willing contributors and its ability to serve as the basis for independent projects.

Finally, the most important aspect of developing any scientific application is the soundness of the methods that it implements and an ability to reasonably validate the results that it produces. The CADD community is a growing one and there is no shortage of contributors of methods in many of its areas. This is a coin of two sides however; as there is a choice of quality methods developed and peer reviewed on one side and the challenge of integrating this diversity of methods and tasks into a cohesive workflow. The latter represents a particular challenge and

6

in addition to addressing the above challenges is one of the primary foci of the work presented in this manuscript.

List of References

(1) Moors, E. H. M.; Cohen, A. F.; Schellekens, H. Towards a sustainable system of drug development. Drug Discov. Today 2014, In Press.

(2) The Nobel Prize in Chemistry 2013. The Nobel Prize in Chemistry 2013 http://www.nobelprize.org/nobel_prizes/chemistry/laureates/2013/ (accessed Apr 4, 2015).

(3) DeLano, W. L. The case for open-source software in drug discovery. Drug Discov. Today 2005, 10, 213–217.

(4) Geldenhuys, W. J.; Gaasch, K. E.; Watson, M.; Allen, D. D.; Van der Schyf, C. J. Optimizing the use of open-source software applications in drug discovery. Drug Discov. Today 2006, 11, 127–132.

(5) Goodsell, D. S.; Olson, A. J. Automated docking of substrates to proteins by simulated annealing. Proteins 1990, 8, 195–202.

(6) Lill, M. A.; Danielson, M. L. Computer-aided drug design platform using PyMOL. J. Comput. Aided. Mol. Des. 2011, 25, 13–19.

(7) Schrödinger, L. Maestro. Maestro, 2014.

(8) Accelrys Software Inc. Discovery Studio Modeling Environment. Discovery Studio Modeling Environment, 2013.

(9) Chemical Computing Group Inc. Molecular Operating Environment (MOE), 2013.08. Molecular Operating Environment (MOE), 2013.08, 2013.

(10) Ebejer, J.-P.; Fulle, S.; Morris, G. M.; Finn, P. W. The emerging role of cloud computing in molecular modelling. J. Mol. Graph. Model. 2013, 44, 177–187.

(11) Halgren, T. A. Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. J. Comput. Chem. 1996, 17, 490–519.

(12) Vanommeslaeghe, K.; Hatcher, E.; Acharya, C.; Kundu, S.; Zhong, S.; Shim, J.; Darian, E.; Guvench, O.; Lopes, P.; Vorobyov, I.; Mackerell, A. D. CHARMM general force field: A force field for drug-like molecules compatible with the CHARMM all-atom additive biological force fields. J. Comput. Chem. 2010, 31, 671–690. 7

(13) Knight, J. L.; Brooks, C. L. Validating CHARMM parameters and exploring charge distribution rules in structure-based drug design. J. Chem. Theory Comput. 2009, 5, 1680– 1691.

(14) Wang, J.; Wolf, R. M.; Caldwell, J. W.; Kollman, P. A.; Case, D. A. Development and testing of a general amber force field. J. Comput. Chem. 2004, 25, 1157–1174.

(15) Zoete, V.; Cuendet, M. A.; Grosdidier, A.; Michielin, O. SwissParam: a fast force field generation tool for small organic molecules. J. Comput. Chem. 2011, 32, 2359–2368.

8

CHAPTER TWO:

DEVELOPMENT OF THE CHARMMING WEB USER INTERFACE AS A PLATFORM

FOR COMPUTER-AIDED DRUG DESIGN

Note to Reader

Reprinted (adapted) with permission from Yuri Pevzner †, Emilie Frugier ‡, Vinushka

Schalk †§, Amedeo Caflisch ‡, and H. Lee Woodcock *† Journal of Chemical Information and

1Modeling 2014, 54, 2612-2620 Copyright 2014 American Chemical Society

The publication follows; supporting information can be found in Appendix A.

CHARMM Interface and Graphics Platform

The CHARMM interface and graphics (CHARMMing)1 is an open source Web interface to the popular macromolecular modeling package CHARMM.2,3 The goal of the CHARMMing project is to provide a platform-independent Web-based front-end that allows its users to set up and perform a wide variety of molecular modeling tasks. At its inception CHARMMing project was designed to satisfy some of the unmet needs of the drug discovery community as well as address a number of the implementation challenges mentioned in chapter one. CHARMMing is built using Django4 framework which is based on a widely used Python5 scripting language.

† Department of Chemistry, University of South Florida, 4202 E. Fowler Ave., CHE205, Tampa, Florida 33620-5250, United States ‡ Department of Biochemistry, University of Zurich, CH-8057 Zurich, Switzerland § Department of Natural Sciences, New College of Florida, Sarasota, Florida 34243, United States

9

Python is a high level language that is relatively easy to learn even for somebody with limited exposure to programming while at the same time is adherent to the more advanced object- oriented programming principles. At the same time Django uses a consistent model-view paradigm that facilitates consistency of the code structure and makes it easy to add and effectively encapsulate new features, modules and functionality. Although Django uses a relational database (MySQL6 in the context of this project) as a back-end storage engine, database access is facilitated via an intuitive ORM (object-relational mapping) API (application program interface) that allows database operations to be performed even by developers with no knowledge of SQL (Structured Query Language) that is natively used to operate on relational . The use of above technologies makes contribution to CHARMMing project possible from a variety of platforms. Moreover the end product of the project is a web-based application.

This means that the end-user functionality is served via a web browser, thus eliminating the issues of cross-platform compatibility as modern browsers are able to serve content rather consistently over a wide range of computer architectures, operating systems and devices.

CHARMMing’s basic functionality includes structure processing, setting up and running molecular dynamics simulation and other molecular modeling tasks. It directly interfaces with

PDB (Protein Data Bank)7 to retrieve structures and includes an internal parser and structure processing mechanism to prepares biological systems of interest for molecular modeling studies.

CHARMMing’s infrastructure includes PBS (Portable Batch System)8 for scheduling of jobs submitted by multiple users and the jobs’ distribution over available HPC resources.

CHARMMing’s users range from small academic labs, that benefit from the portal’s functionality, to educators, that include molecular modeling in their curricula and use the portal to facilitate their teaching.9–11

10

CHARMMing is open source and the code can be downloaded for free. In addition

CHARMMing comes with an automated installer that can be used to install an instance of

CHARMMing infrastructure and the interface on a Linux based architecture. This is particularly useful when developing a custom and possibly proprietary application using CHARMMing with all or some of its functionality as a starting point. The development of the VTS (Virtual Target

Screening)12 server described later in this work is an example of such use of CHARMMing. The remainder of this chapter describes the development and evaluation of fragment-based docking protocol as an extension of CHARMMing functionality.

Fragment-based docking protocol in CHARMMing

Implementation Details

Target Preparation

Target proteins begin their preparation via CHARMMing’s structure submission section.

Here, tasks such as the addition of hydrogens, identification of any non-protein moieties, and assignment of final parameters are carried out (using the latest, CHARMM36 protein force field).13,14 Co-crystallized small molecules (i.e., ligands) are automatically parameterized using the CGenFF.15 Specifically, ligand atom-typing and parameterization is performed by sequentially attempting several automated parameterization tools. The default order is: (1)

Param-Chem,15–17 (2) MATCH,18 (3) Antechamber,19 and (4) GENRTF.20 As an alternative to the default order a user can specify the exact build procedure to use for parameterization.

Compound Library and Ligand Upload

CHARMMing docking module provides a pre-loaded library of drug-like compounds for

11

virtual screening experiments. The library consists of approximately 8,000 molecules from the

Maybridge HitfinderTM set (www.maybridge.com). All of the provided molecules have been atom typed according to CGenFF convention to comply with CHARMM requirements and confirmed to decompose into at least 3 sufficiently sized fragments to meet the fragment-based docking criteria. CHARMMing also allows users to upload ligands by providing a coordinate file in mol2 format. Upon uploading, the ligand undergoes atom-typing and parameterization as previously described. The ligand and corresponding parameter, topology, and structure files are then saved on disk as well as cataloged in the database. The parameters for pre-loaded and user- uploaded ligands contain penalty scores that reflect the quality of the bonded parameters and partial charges.17 These penalties should be used by the user to make decisions regarding compounds that should be discarded due to the lack of quality parameters. Unlike the pre-loaded compound library, any user-uploaded ligands are restricted to their account only and are not visible to other users. The user is also given the ability to create custom sets of molecules based on any pre-loaded or user uploaded compounds. This can be done via the “Ligand Sets” section

(Figure 1) of the docking module. Any custom or pre-loaded set can be docked in its entirety or by selecting individual molecules on the docking submission page (Figure 2).

Binding Site Definition

To provide maximum flexibility with respect to job setup, two different ways of specifying the binding region of interest are implemented. The first approach identifies the binding pocket using the position of a co-crystallized ligand that may be present. In this case, when launching a docking job a user is presented with a list of all co-crystallized small

12

Figure 1. The “Ligand Set Details” page allows the user to manage custom ligand sets. The user can define and describe a custom ligand set as well as add ligands to it from any of the other sets including the pre-loaded public library.

molecules along with their 2D structural representations. Once the desired small molecule is chosen, the binding site is defined via proximity to the aforementioned small molecule. In cases where no co-crystallized ligand is present, or if a user simply wishes to investigate alternative binding sites, we have implemented an interactive and graphical binding site definition tool

(Figure 3). To use this tool, two residues should be selected that roughly correspond to the edges

13

Figure 2. The “Submit Docking Job” page. This page presents the user with the ability to select the target coordinates for docking, define the binding pocket (vide infra), and select ligands to dock from the list of available small molecules. The native ligands and ligands available for docking can be visualized in 3D using the embedded visualization application. of the desired binding region. The midpoint between these residues is then determined and defined as the approximate center of the binding site. Based on a user-defined radius, a list of all residues within this distance is compiled and both visually highlighted and presented as a list.

The user can then add or remove residues to/from this list by either modifying the text of the residue list, changing the specified search radius, or modifying it via graphical selection (i.e., clicking). Ultimately, all user-defined binding sites are saved and presented as options, with any existing co-crystallized ligands, at the docking job setup page.

14

Figure 3. The binding site definition page. This page provides the user with multiple ways to select a custom binding site. This can be done either by manually typing in the residue numbers, graphically selecting residues, or defining the centroid and specifying the radius in Å.

Docking Protocol

Docking algorithms used in this protocol are based on the popular grid-based paradigm, used by most current docking programs.21–26 In this approach the solvent accessible surface area of the target and the ligand as well as the target’s binding site are discretized onto a 3D lattice.

The lattice then either stores information about the atoms enclosed by a cubic unit of the grid or contains the potential contributions projected onto the grid’s vertices. Pre-computed grids allow

15

for efficient calculation of both van der Waals and electrostatic contributions to the scoring function, facilitating rapid evaluation of ligand placements within the binding site.

The docking procedure consists of several steps where different programs perform distinct tasks. To streamline the communication between the programs and ensure compatibility of input and output data a series of scripts were written in Python, Perl, and Linux shell scripting languages. The OpenBabel27 file conversion utility was used to inter-convert between different representations of the protein and compound structures. The program MATCH30 was used to generate CGenFF compatible topologies and parameters. The fragment-based docking protocol implemented in CHARMMing is outlined in Figure 4 and described below:

1. Each compound to be docked is first broken down into fragments. A fingerprint

describing chemical richness is generated for each fragment and its parent compound.

The three most chemically rich, but not necessarily different, fragments are identified to

serve as anchors for docking. These steps are carried out by the program DAIM

(Decomposition and Identification of Molecules).28

2. The user then identifies the binding site to be used in the docking job. All non-

protein, non-solvent compounds present in the submitted target structure are displayed on

the “Submit Docking Job” page (Figure 2). Based on the user selected compound, the

proximal residues are identified and the binding site defined.

3. The previously identified anchor fragments (step 1) are then docked into the

binding site using the program SEED (Solvation Energy for Exhaustive Docking).29 The

placement of fragments within the binding site is determined by matching either the

direction of polar vectors between ligand and receptor atoms to form a hydrogen bond or

the apolar vectors on the solvent accessible surface area of the ligand the receptor. The

16

A

B

Figure 4. Schematic depiction of fragment-based docking protocol implemented into CHARMMing. A- The main stages of the docking: decomposition by DAIM, fragment docking by SEED, and ligand placement by FFLD. B – Flow chart representing the sequence of the procedures of the protocol. CHARMM(ing) world grayed area represents the steps which include CHARMM compatible protein-ligand system.

SEED score, used in fragment placement, accounts for the solvent effects by including terms for both receptor and fragment desolvation as well as a solvent screened receptor- fragment electrostatic interaction term. 17

4. The docked fragments are reconnected into the original ligand while undergoing refinement using the FFLD (Fragment-based Flexible Ligand Docking) program.30 FFLD uses a genetic algorithm that generates and evaluates populations of conformations and positions them within the binding site, as guided by fragment anchor locations. The fitness of a placed con- formation is evaluated using a scoring function that is aimed at approximating the steric effects as well as hydrogen bonding contributions of the protein- ligand interactions. This function includes intra-ligand and protein-ligand van der Waals interaction terms as well as polar contributions based on the number of hydrogen bonds and unfavorable donor-donor and acceptor-acceptor interactions.

5. Poses generated by FFLD that are within a user defined energy cutoff (10 kcal/mol by default) are then clustered using a leader clustering algorithm implemented in the program FLEA (FFLD Leader Clustering).31

6. Following the clustering, the protein-ligand complex is converted to native

CHARMM format and saved. Using these files, in addition to the CHARMM protein and generalized force fields (i.e., CHARMM36 and CGenFF), protein structure (psf) and coordinate (crd) files are generated. Each ligand then undergoes one thousand steps of minimization using the adopted basis Newton-Raphson (ABNR) algorithm while keeping protein atoms fixed. The “minimized” protein-ligand complexes are then scored using

SEED and FFLD in their “evaluation only” mode, producing their own estimation of electrostatic, van der Waals, and total energy contributions for each pose. The final ranking of the docked poses is performed using a consensus approach. For this, energies

(i.e., interaction energy from CHARMM and total energies from SEED and FFLD) are

18

used to create three lists in which individual poses are sorted and ranked. The final rank

of each pose is then set to

Job Submission and Monitoring

When a docking job is launched, the based queuing system TORQUE51 accepts the job as a wrapper shell script that controls the entire docking procedure. Using the interface, a job can be monitored in real-time as it progresses and generates final poses for each docked compound.

Basic job statistics such as submission time and job status can be monitored along with the output file reflecting the job progression (Figure 5). In addition, important files associated with job progress and results (e.g., final docked ligand poses, job output, etc.) can be downloaded to a local disk. Protein, ligands, compounds in the library, and final docked poses can all be visualized directly in CHARMMing. The 3D structure of each of the above elements can be rendered with the JSmol38 or GLmol39 visualization tools. Structures can be visualized using a variety of representations to highlight important structural features or interactions of the molecules and their complexes.

A walk-through outlining the entire process of performing a self-dock on a sample system is included in the tutorial covering basic CHARMM and CHARMMing functionality at www.charmmtutorial.org. Additionally a docking lesson that guides a user through the self- docking procedure has been added to the lessons section of the CHARMMing website.

Performance and Local Execution

Currently, all docking jobs executed via the Web interface are carried out sequentially.

19

Figure 5. The “Job Details” page. This page provides general job information as well as the list of docked poses and their respective scores. The docked poses can be visualized in 3D within the binding pocket of the protein using the embedded visualization application. An archive of the job directory can also be downloaded from this page for execution on local resources.

However, after the initial setup of the docking job, all necessary files are available for download and execution on local computational resources. To improve performance of this procedure, we have developed a protocol that can be carried out in parallel as outlined in Figure 6. This is achieved by spawning a new execution branch for each of the most time consuming steps in the protocol via a user-modifiable job queuing command. For example, each fragment of each molecule is docked (step 3, vide supra) as a separate submitted job. Once all of a molecule’s

20

anchor fragments are docked, the placement of a ligand within the binding site by FFLD is also spawned as a series of separate jobs. Furthermore, to increase sampling by FFLD, and improve performance, the protocol performs multiple docking iterations per ligand, again each as a separate job. Thus, instead of one docking job that attempts to sequentially sample a large conformational space per ligand, multiple shorter iterations with different random seeds are run in parallel, taking less real time and still sufficiently sampling ligand conformational space. The number of iterations per ligand as well as the amount of energy evaluations per iteration are all user modifiable parameters.

In order to execute a job on local resources the following programs need to be downloaded and installed: VMD,40 DAIM, SEED, FFLD, FLEA, MATCH, and CHARMM.

Except for CHARMM, all of these programs are free for academic use. VMD can be downloaded from the University of Illinois at Urbana-Champaign’s Theoretical and

Computational Biophysics group www.ks.uiuc.edu/Research/vmd). DAIM, SEED, FFLD, and

FLEA can be obtained from the University of Zurich’s Computational Structural Biology lab

(www.biochem-caflisch uzh.ch/download). Further, a more general description of the installation process is included as part of the CHARMM tutorial and can be found at the following address: www.charmmtutorial.org/index.php/Installation_of_CHARMMing.

Once the job directory is downloaded and the software is installed on local resources, the provided settings file should be used to specify the location of program executables. In addition, job details (e.g., protein file name, number of docking iterations, clustering energy cutoff, etc.) can be modified via the settings file. This file is also where PBS/TORQUE commands can be modified for local resources. Because there is no limit to the number of possible parallel processes spawned, the protocol checks for available resources and will wait for current

21

Figure 6. Parallelization of the docking protocol. The parallelization is achieved by spawning new job execution threads at both the fragment docking (i.e., one per fragment) and ligand placement (i.e., one per iteration per ligand) steps. Clustering and scoring threads are also spawned for each docked ligand.

processes to complete if the queue is full. The protocol will automatically take advantage of all available resources to speed up job completion while at the same time adhering to the local queuing system policies.

Results and Discussion

To assess the performance of the docking protocol a diversity set was constructed from the publicly available CCDC/Astex test set,41 containing high resolution x-ray complexes, and an

22

augmented version of that set, which has been used to compare the performance of a number of docking programs.42 Our final set contained 24 protein - ligand complexes with x-ray resolutions ranging from 1.50Å – 2.30Å. In particular, we selected complexes where the ligand could be decomposed into three fragments (i.e., at least three rotatable bonds) using the default settings of

DAIM; as the ultimate goal was to evaluate the implementation of the decomposition based approach.

Self-dock validation involved removing the co-crystallized ligand from the complex, self- docking it via the fragment-based protocol, and comparing the docked pose to that of the original crystal structure. Each complex was processed using CHARMMing’s “Submit Structure” section that downloads the structure based on the PDB code, adds hydrogen atoms, and prepares the structure for modeling using CHARMM. Further, each system containing the protein, solvent and ligand molecules was briefly minimized for 100 steps using the Steepest Descent method followed by 1000 steps of ABNR using CHARMMing’s “Calculations” module. Using the

“Ligand Upload” section of CHARMMing’s docking module the previously downloaded ligand was processed. The docking calculation for each minimized system was set up by selecting a native ligand to define a binding pocket and user-uploaded ligand for docking, all from the

“Submit Docking Job” page of the docking module. The progress of each job was monitored using the job monitoring section of the docking module. To assess the performance of the dockings, root-mean-square deviation (RMSD) between the heavy atoms of the docked poses and the crystal structures was calculated using VMD.

To compare the docking protocol’s performance, a commercially available docking package was also used. Self-dockings were performed using Schrödinger’s Glide22–24,43 Standard

Precision (SP) docking protocol. Glide’s SP protocol attempts to dock multiple conformations of

23

a ligand into a receptor grid, subsequently calculating the effective ligand-receptor interactions using a proprietary scoring function. Conformational sampling of the ligand is achieved via varying torsion angles around rotatable bonds. Prior to docking, each target was prepared using

Maestro’s44 Protein Preparation Wizard.45–49 The preparation included removal of solvent molecules, addition of hydrogens, and brief minimization. As Glide is also a grid based docking protocol, the grids, similarly to CHARMMing’s procedure, were built using the co-crystal ligand to define the binding region. The native ligand was removed and self-docked using default parameters of the SP docking protocol. The poses with the best docking scores were used to calculate their respective RMSD from the crystal structure using VMD.

Table 1 reports the RMSD of poses generated by CHARMMing’s fragment-based docking protocol and Glide SP docking (w.r.t. crystal structure). Results reported from

CHARMMing’s fragment-based docking protocol correspond to the pose closest to the crystal structure. This set yields a 71% success rate using RMSD <2.0 Å as the metric; this criteria is commonly employed for evaluating the performance of docking algorithms.42,50–52 This clearly shows that the protocol can successfully recover the crystal pose in the majority of the cases.

Optimization of the consensus scoring function is planned as part of the future improvements to the protocol. Nevertheless, virtual screening is known to suffer from high false-positives rate, which does not diminish its value in drug discovery as the unfit compounds are screened out during the experimental stages of the discovery campaigns.53 Regardless, we are encouraged by the success of fragment-based docking, which shows approximately the same performance as widely used docking programs, i.e., within the range 40% – 90%.42,50–52

The fragment-based approach that was implemented into CHARMMing yields a substantial amount of information about the characteristics of each docked pose. At each

24

Table 1. RMSDs of the docking poses generated by CHARMMing’s fragment-based docking protocol and Glide SP and the success rates (defined by the percentage of the ligands whose reported RMSD is below 2.0Å). “Best RMSD” refers to the pose closest to the crystal structure. Glide SP RMSD is of the top scoring pose of Glide’s standard precision docking.

step, from decomposition to minimization of docked poses, users have the ability to closely analyze results. The binding modes of each individual fragment can be inspected and a number of modifiable parameters, such as decomposition criteria, can be used to optimize the protocol.

Moreover, information gained from docking a fragment library into a particular target can be used to mine large libraries for compounds containing those fragments that form the most favorable interactions with the target.54–56

There are potentially a number of improvements that can be made to improve the performance and usability of CHARMMing’s docking protocol. The most obvious limitation is the current requirement of three fragments to be used as anchors. As can be seen by the number

25

of ligands eliminated from the original benchmarking set, this limits the applicability of this protocol in its current form to medium to large sized molecules with a sufficient number of rotatable bonds. Although partially this problem can be alleviated by decreasing the fragment richness threshold at the decomposition step, this will only increase the “eligibility” rate of molecules by a small margin. Alternatively, when docking these small and/or rigid molecules is desired, the decomposition step could be omitted at which point the molecules would undergo docking only by SEED. This however will require prior conformation sampling step as SEED currently does not sample the internal conformation of docked fragments. The conformational sampling of the fragments is an obvious improvement to the docking protocol even in its current state. This addition will help ensure that larger fragments sample their orientations within the binding site while varying their internal geometry, thus ensuring greater enrichment of anchor positions for the final ligand placement. Efforts to incorporate these functionality improvements are currently underway.

Conclusions

This work has described the implementation of a fragment-based docking protocol into the CHARMMing Web interface. The protocol allows users to perform docking and virtual screening calculations online as well as generates self-contained scripts to execute these in parallel on local HPC resources. The performance of the docking protocol was evaluated by carrying out the series of self-dockings and comparing the results against a top commercial docking package. The fragment-based docking protocol yielded results comparable to both the commercial package used herein and a wide variety of additional docking software. Specifically,

26

the rate of recovering the correct x-ray pose with CHARMMing’s protocol was 71%, well within the 40% – 90% range that numerous benchmarking studies have reported.

While the scoring function that CHARMMing uses to rank poses can still be improved, the tool lays substantial ground work for allowing academic labs to set up and perform molecular docking and virtual screening studies. It is important to note that the protocol is able to create

CHARMM formatted protein-ligand systems giving users the ability to access the wide range of functionality that exists in CHARMM. For example, docked poses can easily be refined with MD simulations and pre-docked proteins can be coupled with simulations or normal mode analysis to proceed via an ensemble docking approach. These, in addition to other improvements are currently being developed.

Acknowledgments

HLW3 would like to acknowledge NIH (1K22HL088341-01A1), DOE (DE-

SC0011297TDD), NSF (CHE1156853), and the University of South Florida (start-up) for funding. Computations were performed at the USF Research Computing Center. The authors thank Dr. Sandra Rennebaum for her assistance with troubleshooting docking programs used in the protocols, Dr. Peter Kolb for help with DAIM, Prof. Alex MacKerell and Dr. Kenno

Vanommeslaeghe for assistance with small molecule atom-typing and access to the Maybridge library. Finally, we especially thank Mr. Tim Miller, Dr. Bernard Brooks, and the National

Heart, Lung, and Blood Institute of the NIH for support and collaboration.

27

List of References

(1) Miller, B. T.; Singh, R. P.; Klauda, J. B.; Hodoscek, M.; Brooks, B. R.; Woodcock, H. L. CHARMMing: a new, flexible web portal for CHARMM. J. Chem. Inf. Model. 2008, 48, 1920–1929.

(2) Brooks, B. R.; Brooks, C. L.; Mackerell, A. D.; Nilsson, L.; Petrella, R. J.; Roux, B.; Won, Y.; Archontis, G.; Bartels, C.; Boresch, S.; Caflisch, A.; Caves, L.; Cui, Q.; Dinner, A. R.; Feig, M.; Fischer, S.; Gao, J.; Hodoscek, M.; Im, W.; Kuczera, K.; Lazaridis, T.; Ma, J.; Ovchinnikov, V.; Paci, E.; Pastor, R. W.; Post, C. B.; Pu, J. Z.; Schaefer, M.; Tidor, B.; Venable, R. M.; Woodcock, H. L.; Wu, X.; Yang, W.; York, D. M.; Karplus, M. CHARMM: the biomolecular simulation program. J. Comput. Chem. 2009, 30, 1545– 1614.

(3) Brooks, B. R.; Bruccoleri, R. E.; Olafson, B. D.; States, D. J.; Swaminathan, S.; Karplus, M. CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J. Comput. Chem. 1983, 4, 187–217.

(4) Lawrence Journal-World. Django. Django http://www.djangoproject.com (accessed May 30, 2008).

(5) Python Software Foundation. Python. Python.

(6) Oracle Corporation. MySQL. MySQL www.mysql.com (accessed May 15, 2014).

(7) Berman, H. M. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235–242.

(8) OpenPBS. OpenPBS. OpenPBS www.openpbs.org (accessed May 30, 2008).

(9) Miller, B. T.; Singh, R. P.; Schalk, V.; Pevzner, Y.; Sun, J.; Miller, C. S.; Boresch, S.; Ichiye, T.; Brooks, B. R.; Woodcock, H. L. Web-Based Computational Chemistry Education with CHARMMing I: Lessons and Tutorial. PLoS Comput. Biol. 2014, 10, e1003719.

(10) Pickard, F. C.; Miller, B. T.; Schalk, V.; Lerner, M. G.; Woodcock, H. L.; Brooks, B. R. Web-based computational chemistry education with CHARMMing II: Coarse-grained protein folding. PLoS Comput. Biol. 2014, 10, e1003738.

(11) Perrin, B. S.; Miller, B. T.; Schalk, V.; Woodcock, H. L.; Brooks, B. R.; Ichiye, T. Web- based computational chemistry education with CHARMMing III: Reduction potentials of electron transfer proteins. PLoS Comput. Biol. 2014, 10, e1003739.

(12) Santiago, D. N.; Pevzner, Y.; Durand, A. A.; Tran, M.; Scheerer, R. R.; Daniel, K.; Sung, S.-S.; Woodcock, H. L.; Guida, W. C.; Brooks, W. H. Virtual target screening: validation using kinase inhibitors. J. Chem. Inf. Model. 2012, 52, 2192–2203. 28

(13) MacKerell, A. D.; Bashford, D.; Dunbrack, R. L.; Evanseck, J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; Joseph-McCarthy, D.; Kuchnir, L.; Kuczera, K.; Lau, F. T. K.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.; Prodhom, B.; Reiher, W. E.; Roux, B.; Schlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.; Watanabe, M.; Wiórkiewicz- Kuczera, J.; Yin, D.; Karplus, M. All-Atom Empirical Potential for Molecular Modeling and Dynamics Studies of Proteins. J. Phys. Chem. B 1998, 102, 3586–3616.

(14) Mackerell, A. D.; Feig, M.; Brooks, C. L. Extending the treatment of backbone energetics in protein force fields: limitations of gas-phase quantum mechanics in reproducing protein conformational distributions in molecular dynamics simulations. J. Comput. Chem. 2004, 25, 1400–1415.

(15) Vanommeslaeghe, K.; Hatcher, E.; Acharya, C.; Kundu, S.; Zhong, S.; Shim, J.; Darian, E.; Guvench, O.; Lopes, P.; Vorobyov, I.; Mackerell, A. D. CHARMM general force field: A force field for drug-like molecules compatible with the CHARMM all-atom additive biological force fields. J. Comput. Chem. 2010, 31, 671–690.

(16) Vanommeslaeghe, K.; MacKerell, A. D. Automation of the CHARMM General Force Field (CGenFF) I: bond perception and atom typing. J. Chem. Inf. Model. 2012, 52, 3144– 3154.

(17) Vanommeslaeghe, K.; Raman, E. P.; MacKerell, A. D. Automation of the CHARMM General Force Field (CGenFF) II: assignment of bonded parameters and partial atomic charges. J. Chem. Inf. Model. 2012, 52, 3155–3168.

(18) Yesselman, J. D.; Price, D. J.; Knight, J. L.; Brooks, C. L. MATCH: an atom-typing toolset for molecular mechanics force fields. J. Comput. Chem. 2012, 33, 189–202.

(19) Wang, J.; Wang, W.; Kollman, P. A.; Case, D. A. Automatic atom type and bond type perception in molecular mechanical calculations. J. Mol. Graph. Model. 2006, 25, 247– 260.

(20) Hodoscek, M. Genrtf is a program to convert cartesian coordinates into input files. Genrtf is a program to convert cartesian coordinates into charmm input files http://code.google.com/p/genrtf/.

(21) Meng, E. C.; Shoichet, B. K.; Kuntz, I. D. Automated docking with grid-based energy evaluation. J. Comput. Chem. 1992, 13, 505–524.

(22) Friesner, R. A.; Murphy, R. B.; Repasky, M. P.; Frye, L. L.; Greenwood, J. R.; Halgren, T. A.; Sanschagrin, P. C.; Mainz, D. T. Extra precision glide: docking and scoring incorporating a model of hydrophobic enclosure for protein-ligand complexes. J. Med. Chem. 2006, 49, 6177–6196.

29

(23) Halgren, T. A.; Murphy, R. B.; Friesner, R. A.; Beard, H. S.; Frye, L. L.; Pollard, W. T.; Banks, J. L. Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. J. Med. Chem. 2004, 47, 1750–1759.

(24) Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D. T.; Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.; Francis, P.; Shenkin, P. S. Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J. Med. Chem. 2004, 47, 1739–1749.

(25) Grosdidier, A.; Zoete, V.; Michielin, O. Fast docking using the CHARMM force field with EADock DSS. J. Comput. Chem. 2011.

(26) Wu, G.; Robertson, D. H.; Brooks, C. L.; Vieth, M. Detailed analysis of grid-based molecular docking: A case study of CDOCKER-A CHARMm-based MD docking algorithm. J. Comput. Chem. 2003, 24, 1549–1562.

(27) O’Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. Open Babel: An open chemical toolbox. J. Cheminform. 2011, 3, 33.

(28) Kolb, P.; Caflisch, A. Automatic and efficient decomposition of two-dimensional structures of small molecules for fragment-based high-throughput docking. J. Med. Chem. 2006, 49, 7384–7392.

(29) Majeux, N.; Scarsi, M.; Caflisch, A. Efficient electrostatic solvation model for protein- fragment docking. Proteins 2001, 42, 256–268.

(30) Budin, N.; Majeux, N.; Caflisch, A. Fragment-Based flexible ligand docking by evolutionary optimization. Biol. Chem. 2001, 382, 1365–1372.

(31) Caflisch, F. D. and A. FLEA - FFLD Lead Clustering. Unpublished.

(32) Charifson, P. S.; Corkery, J. J.; Murcko, M. A.; Walters, W. P. Consensus Scoring: A Method for Obtaining Improved Hit Rates from Docking Databases of Three-Dimensional Structures into Proteins. J. Med. Chem. 1999, 42, 5100–5109.

(33) Ece, A.; Sevin, F. The discovery of potential cyclin A/CDK2 inhibitors: a combination of 3D QSAR pharmacophore modeling, virtual screening, and molecular docking studies. Med. Chem. Res. 2013, 22, 5832–5843.

(34) Houston, D. R.; Walkinshaw, M. D. Consensus docking: improving the reliability of docking in a virtual screening context. J. Chem. Inf. Model. 2013, 53, 384–390.

(35) Ji, X.; Zheng, Y.; Wang, W.; Sheng, J.; Hao, J.; Sun, M. Virtual screening of novel reversible inhibitors for marine alkaline protease MP. J. Mol. Graph. Model. 2013, 46, 125–131.

30

(36) Klon, A. E.; Glick, M.; Davies, J. W. Combination of a naive Bayes classifier with consensus scoring improves enrichment of high-throughput docking results. J. Med. Chem. 2004, 47, 4356–4359.

(37) Friedman, R.; Caflisch, A. Discovery of plasmepsin inhibitors by fragment-based docking and consensus scoring. ChemMedChem 2009, 4, 1317–1326.

(38) Hanson, R. M.; Prilusky, J.; Renjian, Z.; Nakane, T.; Sussman, J. L. JSmol and the Next- Generation Web-Based Representation of 3D Molecular Structure as Applied to Proteopedia. Isr. J. Chem. 2013, 53, 207–216.

(39) Nakane, T. Glmol - molecular viewer on webgl/, version 0.47. Glmol - molecular viewer on webgl/javascript, version 0.47 http://webglmol.sourceforge.jp/index- en.html. (accessed May 15, 2014).

(40) Humphrey, W.; Dalke, A.; Schulten, K. VMD: Visual molecular dynamics. J. Mol. Graph. 1996, 14, 33–38.

(41) Nissink, J. W. M.; Murray, C.; Hartshorn, M.; Verdonk, M. L.; Cole, J. C.; Taylor, R. A new test set for validating predictions of protein-ligand interaction. Proteins 2002, 49, 457–471.

(42) Cross, J. B.; Thompson, D. C.; Rai, B. K.; Baber, J. C.; Fan, K. Y.; Hu, Y.; Humblet, C. Comparison of several molecular docking programs: pose prediction and virtual screening accuracy. J. Chem. Inf. Model. 2009, 49, 1455–1474.

(43) Schrödinger, L. Small-Molecule Drug Discovery Suite 2014-1: Glide. Small-Molecule Drug Discovery Suite 2014-1: Glide, 2014.

(44) Schrödinger, L. Maestro. Maestro, 2014.

(45) Schrödinger Suite 2013 Protein Preparation Wizard. Schrödinger Suite 2013 Protein Preparation Wizard.

(46) Schrödinger, L. Epik. Epik, 2013.

(47) Schrödinger, L. Prime. Prime, 2014.

(48) Schrödinger, L. Impact. Impact, 2013.

(49) Sastry, G. M.; Adzhigirey, M.; Day, T.; Annabhimoju, R.; Sherman, W. Protein and ligand preparation: parameters, protocols, and influence on virtual screening enrichments. J. Comput. Aided. Mol. Des. 2013, 27, 221–234.

(50) Perola, E.; Walters, W. P.; Charifson, P. S. A detailed comparison of current docking and scoring methods on systems of pharmaceutical relevance. Proteins 2004, 56, 235–249. 31

(51) Srivastava, H. K.; Chourasia, M.; Kumar, D.; Sastry, G. N. Comparison of computational methods to model DNA minor groove binders. J. Chem. Inf. Model. 2011, 51, 558–571.

(52) Neale, D. S.; Thompson, P. E.; White, P. J.; Chalmers, D. K.; Yuriev, E.; Manallack, D. T. Binding Mode Prediction of PDE4 Inhibitors: A Comparison of Modelling Methods. Aust. J. Chem. 2010, 63, 396.

(53) Shoichet, B. K. Virtual screening of chemical libraries. Nature 2004, 432, 862–865.

(54) Zhao, H.; Gartenmann, L.; Dong, J.; Spiliotopoulos, D.; Caflisch, A. Discovery of BRD4 bromodomain inhibitors by fragment-based high-throughput docking. Bioorg. Med. Chem. Lett. 2014, 24, 2493–2496.

(55) Zhao, H.; Dong, J.; Lafleur, K.; Nevado, C.; Caflisch, A. Discovery of a Novel Chemotype of Tyrosine Kinase Inhibitors by Fragment-Based Docking and Molecular Dynamics. ACS Med. Chem. Lett. 2012, 3, 834–838.

(56) Kolb, P.; Kipouros, C. B.; Huang, D.; Caflisch, A. Structure-based tailoring of compound libraries for high-throughput screening: discovery of novel EphB4 kinase inhibitors. Proteins 2008, 73, 11–18.

32

CHAPTER THREE:

DEVELOPMENT AND APPLICATION OF THE VIRTUAL TARGET SCREENING

PROTOCOL

Development of the Virtual Target Screening Server

Methodology Background

Active development and, in large part, acceptance of docking algorithms and software in recent years has allowed the concept of virtual screening gain the status of a widely used tool during early stages of drug discovery.1–5 Virtual screening (VS) involves docking of a set of small molecules into a binding site of a target of interest represented by a computer model.

Depending on the availability of computational resources and the nature of the docking algorithm used, rather large (millions of compounds) libraries of molecules can be docked in such manner. VS represents a quick and less thorough and less expensive sweep of the chemistry space than most other computational or experimental techniques. This allows researchers to start with a broader representation of the chemistry space and narrow down the list of potential drug candidates committing additional computational and experimental resources to a more focused set of compounds further down the discovery “funnel”.1–5

The idea of virtual target screening (VTS) is based on the same molecular docking principles, however is applied in reverse, hence its other common names - inverse docking or virtual counter - screening. VTS involves docking of a single molecule of interest (MOI) into a set or a library of proteins. Thus, VTS’s applicability in the context of a discovery project becomes rather different from that of the conventional VS as VTS is applied at later stages of

33

drug discovery. The MOI, as its name suggests, is usually a chemical substance of particular interest. MOI may be a hit that resulted from a VS campaign, or most commonly it is a lead or a drug candidate. In the latter case VTS can help gain insight into promiscuity of the lead compound or its selectivity towards a particular target. VTS, due to its very nature, is an approach to take when drug repurposing or repositioning is sought. In this case the MOI is often an already existing approved drug. However MOI can also be a drug candidate that failed at a later stage of development for a particular indication and an alternative mode of action is being probed.

VTS is an approach that is beneficial at multiple stages of drug discovery and has the potential to save substantial costs and efforts by identifying problematic compounds at earlier stages of the discovery. It can also help find new directions in the development of a pharmaceutical agent, gain important insight into the relationships between different ligand and protein families and help identify biological targets of known chemical substances. The remainder of this chapter focuses on the implementation of the VTS server on a private network of H. Lee Moffitt Cancer Center using CHARMMing open source framework as its base. In addition, two examples of applicability of VTS protocol in the context of drug discovery are presented. First example focuses on identifying potential targets for natural products, while the second case describes the application of VTS to search for biomolecular targets of vitamin E δ- tocotrienol.

VTS Protocol

There exist a number of systems based on the concept of virtual target screening.6–9

Although the main goal remains essentially the same, different approaches have been taken to

34

arrive at it. Some implementations compare MOI to known actives, some put emphasis on the comparison of binding sites. While these approaches are useful and even successful each in their own way, the problems still remain. Lack of known actives with sufficient affinities for a particular target, insufficient amount of structural information to thoroughly define special descriptors of binding sites all present challenges.

VTS protocol implemented at the Moffitt Cancer Center uses a procedure developed by

Santiago et.al.10 The procedure includes a calibration step for each protein in the database. The calibration is intended to establish standards that an MOI should meet in order to be interpreted as a “hit” according to the protocol. The calibration involves docking of the National Cancer

Institute (NCI) Diversity Set I2 consisting of structures of 1990 drug-like compounds and recording the average docking scores of top 20 and top 200 scoring compounds as well as overall average and Boltzmann weighted average. The calibration is performed on each of the 1,567 protein structures in the VTS database. Upon docking of MOI, the docking score of the MOI is compared to the top 20 average for that protein and if MOI scores better it is considered as “hit”.

If a screen results in a small number of hits, the criteria can be expanded to the top 200 average.

VTS Web-based interface

To facilitate the application of the VTS protocol, a web-based interface that provides a user friendly, quick and automated tool for docking MOIs into collections of user-defined proteins was developed. The framework for the online VTS interface is based on the open

CHARMMing package. Publicly available installation of CHARMMing framework was used to deploy it on a server that was a part the private network of Moffitt Cancer Center. The

35

installation has provided basic out of the box components that included database storage engine, user management system, batch job scheduling infrastructure and a user interface with basic structure processing and job submission and monitoring functionality. The graphical Virtual

Target Screening (gVTS) system includes tools necessary to set up and initiate VTS experiments.

Functionality implemented in the context of this project includes the following:

 Maintain a library of protein grids for docking of small molecules.

o User-prepared grids, based on the proteins of interest can be uploaded and stored

in the internal database.

o Ability to create custom grid sets that can represent structures specific to a given

VTS experiment.

 Maintain database of MOIs.

o User can submit MOIs either by uploading the Cartesian coordinates or by

drawing a molecule via a 2D chemical drawing interface, JChemPaint

(jchempaint.sourceforge.net), which is included in gVTS (Figure 7).

o All submitted MOIs are atom-typed and energy minimized with MacroModel.11

 Initiate VTS runs and analyze results

o VTS jobs can be set up with any number of MOIs against either the entire library

of proteins or a custom created subset.

o Job runtime estimation algorithm predicts an approximate execution time based

on the number of MOIs, rotatable bonds per MOI, and number of screened

proteins.

36

 Job scheduling and queuing system provided by the CHARMMing interface allows for

submission of multiple jobs and can be interfaced to popular queuing systems such as

Torque, PBS, and sun grid engine (with slight modifications).

 Complete, up to the second, information on any job currently running or run in the past is

available and includes but is not limited to the status of the job, information about any

resulting hits, structures being screened/hit, log, and output files. In addition, the user is

able to visualize the docking pose of any MOI in a protein hit (Figure 8).

 User authentication system

o To ensure privacy of the data, the Django/CHARMMing based user

authentication system in combination with database identifiers protects each

user’s information such as MOIs, protein structures, jobs, etc. and allow access

only by authorized persons.

The interface has been developed using the underlying CHARMMing infrastructure at the time of the implementation using Python 2.6 programming language and the Django 0.96 object framework. A MySQL 5.1.37 database is used to maintain system information and user generated data. Perl scripts provide the interface to the Schrodinger12 software suite.

Figure 7. Example use of the JChemPaint applet embedded into the gVTS. This particular example illustrates the user- drawn structure of Staurosporine.

37

Figure 8. Illustration of VTS screen results page. The results of the screening of Staurosporine against a set of kinases.

Application of VTS to Identify Protein Targets of Natural Products in Drug Discovery

Natural Products and VTS

Natural products have historically been of interest to the drug discovery community.

However, despite the availability of a wide range of drug discovery techniques and the growing number of synthetic compound libraries the number of newly discovered drugs has dropped in recent years. This trend has largely been the impetus for the renewed interest in natural products in drug discovery13–19. Moreover recent advances in extraction, purification and structure determination techniques along with improved screening methods such as miniaturized high-

38

throughput phenotypic assays have made it less challenging for natural products to be interrogated with respect to their potential as therapeutic.20–22

The initiative to screen a collection of natural products described in this work consisted of two separate subtasks. Both tasks essentially amounted to a combination of conventional virtual screening and virtual target screening in a sense that the collections of natural product compounds were screened against a collection of proteins in the VTS database. One task involved screening of a publically available set of natural products from the Developmental

Therapeutics Program (DTP) of the National Cancer Institute (NCI). Another task was screening of non-public set of natural products from the Center for Drug Discovery (CDD) and Innovation at the University of South Florida (USF) in Tampa. The NCI set contains publicly accessible compounds and it includes molecules which have been mentioned in various published studies and have experimental data available to public. This means that in principle the results of VTS screen can be compared and possibly correlated with the available experimental data reported elsewhere. In contrast to the NCI set, CDDI set consists largely of molecules with little or no publicly available data. This means that it may be impossible to correlate the results of the VTS screen with any experimental data. However even without experimental consensus VTS can be used to gauge the potential of a natural product as to its likelihood of binding to target proteins.

Materials and Methods

Virtual Compound Libraries

The NCI Diversity Set I, consisting of 1,990 virtual compounds, was obtained from the

National Cancer Institute’s (NCI) Developmental Therapeutics Program (DTP) and used as described previously for our VTS system5,23. 39

The NCI natural Product Set II includes 120 compounds2. Due to cost and time considerations, we selected 67 compounds that ranged from 314 to 692 Daltons in a loose application of “Lipinski’s rule of 5”. These compounds were screened in VTS. The NCI DTP website also provides links to published experimental data on the compounds in their virtual libraries24. This allows for comparison and validation of VTS “hits” against known experimentally-determined interactions of the compounds when the data is available.

The CDDI library provided to us contained virtual structures for 160 natural product compounds, many originating from marine sources25. These are from the larger collections at the

CDDI, which include over 2,500 extracts and 950 characterized bioactive compounds. Again, due to cost and time consideration, we selected 87 of these compounds for VTS testing with a range of 350–675 Daltons.

Ligand Preparation

Prior to screening all compounds were prepared using the LigPrep utility in Schrodinger's

Maestro software.26,27 LigPrep generates 3D structures of all tautomers, ionic states and stereoisomers of input compounds. In the case of NCI compounds the structure files have indicated chiralities of some stereo centers. Therefore, when preparing the ligands, only the unspecified centers were allowed to sample different chiralities. For CDDI compounds however, to reduce computational time and the amount of isomers, all the chiralities were inferred from the

3D structure. As a result, LigPrep generated a total of 1133 structures from the 67 original NCI compounds and, because no chiralities were varied in the case of CDDI set, a total of only 103 structures resulted from LigPrep for the 87 original CDDI compounds.

40

Virtual Target Screening

We selected primarily human proteins as the targets for this VTS exercise, which yielded

1,059 target grids representing 1,011 unique crystal structures. For this exercise with a large number of MOIs to process in VTS, we estimated there would be over one million docking and scoring calculations. Therefore, the compounds were divided into 25 separate files, each containing subsets of 30–60 structures. This was done to effectively parallelize the VTS calculations. Each of these subsets was then uploaded through the VTS web interface’s MOI upload section. A screening job was then launched from the VTS web interface for each of the

25 subsets to be screened against all 1,059 target grids. On average each job took approximately

30 hours to complete. When analyzing the results of the screening jobs, a compound-protein was deemed to be a “hit” if the score outperformed the average score for the top 20 NCI diversity set compounds for that protein structure. The “hits” were compared against the literature for consistency with published experimental data. The PubChem database was used to look-up experimental activities for the compounds that hit one or more targets.28,29

Hardware

VTS was originally developed and run on a Dell Precision 490 workstation running

Fedora 8 Linux with dual Xeon 3.06 GHz processors, 4 GB RAM and a 250 GB hard drive. This was sufficient when running one or a few MOIs but, for the larger sets of MOIs for the 67 NCI natural products and the 87 CDDI natural products, we incorporated execution on a local cluster we have developed in the Virtual Screening & Molecular Modeling Core at H. Lee Moffitt

Cancer Center & Research Institute. This cluster is comprised of 6 racks each containing 8 Intel- 41

based dual core desktop computers connected via a NETGEAR Pro Sage 16 router (model

GS716T).

Graphics and Modeling

Visualization of docked MOIs in protein targets was performed with Schrödinger’s

Maestro, a graphical user interface that allows modeling and facilitates use of many of

Schrödinger’s applications with the models, such as MacroModel and Glide5,11,30. Refined depictions used for publication figures were developed using Schrödinger’s PyMol.31

Results

VTS on NCI Natural Products

Upon compilation of the results of all the jobs, a total of 13,278 hits were identified for the NCI Natural Products Set II compounds tested. For 4 of the screened NCI molecules VTS identified a total of 16 targets against which screened compounds were found active in one or more experimental assays (Table 2). The most “active” MOI Daunorubicin (Figure 9) hit a total of 6 unique proteins. The most interesting of those hits can be considered the Lck tyrosine kinase protein. Daunorubicin has hit 5 different structures that represent this protein. Moreover in at least two reported assays Daunorubicin had indeed shown activity against this particular target. It may be speculated that a compound hitting multiple structures of the same protein may indicate likely activity against that protein. In addition, results of the NCI natural products screen show preference of some compounds towards one or more protein families as can be seen in the case

42

of Wortmannin and a kinase family (Table 2). It should be noted however that kinases are disproportionally well represented (25%) in our protein database due to their importance in cancer. Still an insight like this can be used to gauge the potential for more specific targeting as well as possible promiscuity of an MOI.

Table 2. Top NCI natural products screen hits. Compounds with known activity against the targets that have been identified as hits in the VTS screen. Column 1 contains the NSC number of the compound, column 2 contains the common name of the compound, column 3 contains 2D sketch of the compound, column 4 contains the name of the protein targeted by the compound, column 5 shows the number of unique structures representing the target identified as a hit for the given MOI, column 6 is the number of assays that have reported activity of the MOI against this target.

NSC Compound Structure Target # of unique # of assays Name structures reporting identified by activity VTS* 82151 Daunorubicin Lck tyrosine kinase 5(11) 2

Estrogen Receptor 2(6) 8 Alpha

Epidermal Growth Factor Receptor 3(5) 1

Thyroid hormone 1(6) 4 receptor

Mitogen Activated 1(39) 1 Protein Kinase 14

Fyn tyrosine kinase 1(3) 1

43

Table 2 (Continued)

5159 Chartreusin Urokinase-type plasminogen 2(35) 1 activator** Heat Shock Protein HSP 90-alpha*** 1(13) 1

Cyclin-Dependent Kinase 5 1(2) 1

221019 Wortmannin Cyclin-Dependent Kinase 2 6(105) 1

Serine/threonine kinase PIM1 2(17) 1

Src tyrosine kinase 2(36) 1

Lck tyrosine kinase 1(11) 2 Cyclin-Dependent Kinase 5 1(2) 1

3-phosphoinositide-dependent protein 1(9) 1 kinase 1

345647 Chaetochromin Glutathione-S-transferase Omega 1 1(3) 2

* - Number in brackets denotes the total number of structures in VTS database.

** - VTS screening included and hit the human version of the protein while the experimental data is on mouse.

*** - VTS screening included and hit the human version of the protein while the experimental data is on a structurally similar Plasmodium falciparum.

44

Figure 9. Daunorubicin docked into Lck. Daunorubicin’s docked pose according to the VTS docking procedure. Inset shows the local hydrogen bond interactions in yellow dashed lines as well as the residues with which these interactions are formed in stick representation. The Lck structure was 1QPJ.pdb.32

VTS on CDDI Natural Products

The screening of CDDI compounds against our database has also yielded interesting results. Although assay information is not as readily available for these compounds, the results of the VTS can still show interesting trends and insights valuable for drug discovery. Table 3 shows the top 40 hits where the MOI has outperformed the top performers of the calibration set by the largest margin, thus making these compounds more likely to be tighter binders to the indicated targets. It can be seen that there are a number of molecules, as identified by their unique 45

molecular weight that hit several targets, as is especially the case with compounds 374, 587 and

629. Once again, this may be an early indicator of potentially promiscuous compounds. At the other end of the spectrum are compounds such as 287, 368 (Figure 10), 437, 530 and 674. These compounds only hit one target. Especially interesting are those compounds with lower molecular weight as it is easier for a larger compound to score high based on the increased number of favorable interactions it can form with the protein. If a smaller sized yet still a drug-like compound such as 368 significantly outperforms a calibration set for a single particular target, it may be worthy of a further inquiry as a potential binder to that protein.

Table 3. Top CDDI set hits. Top 40 hits based on the largest margin by which an MOI outperformed the top 20 calibration set average. First column contains the name of the target protein with co-crystallized ligand identifier in square brackets. Second column indicates the percentage by which a given MOI outscored the average docking score of the top 20 calibration set molecules for that target. Third column contains the rounded molecular weight of a docked MOI and serves as its identifier. The table is sorted by the molecular weight. Outperformed Molecular Weight Target calibration by (%) (compound ID)

[R11]Coagulation factor x 28 287

[L1G]HCK 28 368

[NHB]Histone deacetylase 8 30 372

[NHB]Histone deacetylase 8 24 372

[NHB]Histone deacetylase 8 36 374

[AO5]Methionine aminopeptidase 2 35 374

[GIP]Protein (lactoylglutathione lyase) 35 374

[NHB]Histone deacetylase 8 29 374

[GIP]Protein (lactoylglutathione lyase) 30 374 [AO2]Methionine aminopeptidase 2 23 374

46

Table 3 (Continued)

[NHB]Histone deacetylase 8 26 437

[NHB]Histone deacetylase 8 23 530

[HYF]Orphan nuclear receptor pxr 35 587

[HYF]Orphan nuclear receptor pxr 33 587

[QPP]CAMK1D 32 587

[SWF]CYP2C9 26 587

[DEO]Macrophage metalloelastase 20 587

[XLD]Coagulation factor x heavy chain 30 587

[RAP]Fkbp25 36 587

[SDK]Cathepsin K 30 587

[POS]Cathepsin K 30 587

[INA]Cathepsin K 29 587

[POS]Cathepsin K 28 587

[BOG]p38-alpha(MAPK14) 25 587

[2CA]Cathepsin k 27 587

[FMM]Epidermal growth factor receptor 23 625

[L1G]HCK 39 629

[471]Peroxisome proliferator activated 42 629 receptor

[U66]Protein farnesyltransferase alpha 31 629 subunit

[CIU]Epoxide hydrolase 2- cytoplasmic 29 629

[POS]Cathepsin K 33 629

47

Table 3 (Continued)

[SDK]Cathepsin K 31 629

[HYF]Orphan nuclear receptor pxr 25 629

[155]Urokinase-type plasminogen activator 27 629

[SAH]Histamine MeXferase (Ile105Var) 22 629

[AIJ]Estrogen receptor 21 629

[BNE]Protein farnesyltransferase alpha 21 629 subunit

[BOG]p38-alpha(MAPK14) 23 629

[LA1]Integrin alpha-L 31 663

[LA1]Integrin alpha-L 19 674

Discussion

We have used 67 compounds from the NCI Natural Products Set II to test our VTS system for its ability to virtually identify potential protein targets, i.e. “hits” that can be compared to published data for those compounds. We have also used 87 compounds from the CDDI natural products to assess their potential interactions with protein targets when they were screened through VTS. The work has also brought to light improvements that can be made to the current VTS system to make it more applicable to our new areas of application in drug discovery and development: infectious diseases and natural products, particularly marine natural products. There is an unmet need for large-scale drug discovery efforts with marine natural products.34,35 There is also opportunity for these large-scale

48

Figure 10. 368 docked into HCK. 368’s docked pose according to the VTS docking procedure. Inset shows the local hydrogen bond interactions in yellow dashed lines as well as the residues with which these interactions are formed in stick representation. The HCK structure was 2COI.pdb.33

efforts with marine natural products and natural products in general to be used more in academic settings, such as the CDDI.36 Drug discovery efforts with marine natural products have been successful with 13 products in clinical trials in 2010 and so these efforts should be expanded to take advantage of new natural products from new marine sources.37 In order to scale-up these efforts and make them more productive, particularly as natural product libraries grow, innovation is needed.38 We believe that VTS is an innovative, cost-effective approach to address the needs for rapid identification and appraisal of natural products. Besides our previous paper23 in which we compared our VTS results against experimental data focusing on kinases, there are other 49

recent reports in which other groups have compared results from computational drug repurposing applications against experimental data demonstrating the usefulness of the tools.39,40

This exercise demonstrates that VTS can identify potential promiscuity of natural product compounds that may deter further development of an MOI if the promiscuity is experimentally confirmed for some of the “hit” proteins. VTS can also give ideas of other targets that may not have been considered previously for the MOI. However, since our VTS protein collection is limited, we cannot yet consider the VTS runs as comprehensive searches. There may be more proteins that would arise as “hits” for these compounds if those proteins had been included.

Other systems are now available online for screening compounds against collections of protein structures, such as HitPick,41 ChemMapper,42 and Mantra.43 A comparison of VTS and other target finding applications was beyond the scope of this study since our purpose was to analyze

VTS with regards to natural products and infectious disease targets. However, we did test two of the top scoring compounds (wortmannin and daunorubicin) from the NCI Natural Products II and got similar hits of kinase proteins as were found by HitPick. A thorough study comparing the various target finding applications is certainly warranted now that more applications are published but it should be comprehensive with regards to ligands (small synthetic molecules and natural products). We believe the VTS approach has benefits in that it can be used towards different sites on a protein, such as allosteric sites, for which there may not be any previously known ligands. We can use Schrödinger’s SiteMap as a means of identifying potential binding sites [54,55].44,45 The drug-like molecules of the NCI Diversity set can quickly calibrate new sites even when there is no previous data. Many other applications are dependent on previously known ligands for specific binding sites in order to make their assessment of an MOI’s interactions.

50

VTS also can be used to compare an MOI against known inhibitors of VTS “hit” proteins and to compare the MOI against its own analogs to test focusing by functional groups as rational design is applied to the MOI to generate improved analogs. Some of the analogs may show more specificity to the intended target and generate fewer VTS “hits”, suggesting that the analog is an improvement over the original MOI. With regards to a known inhibitor, if the MOI shows more specificity (fewer VTS “hits”) than the inhibitor, this can increase interest for the MOI.

The main purpose of VTS is to identify new targets for an MOI, whether they are new targets for repurposing or targets that may indicate possible adverse interactions. For VTS to be effective towards this purpose, we need to increase our collection of proteins. Since VTS was originally developed for supporting oncological drug discovery projects, the emphasis was for inclusion of primarily human proteins. For testing compounds in VTS for infectious disease projects, we need a great increase in protein structures from viruses, bacteria and other microorganism that are relevant to those diseases. We need to grow the VTS protein collection as much as possible using as many structures from the Protein Data Bank, but we still want to focus on those structures that have high resolution (~2.0 Å or less) and primarily wild-type proteins.

We realize now that, even towards oncological drug discovery, proteins from microorganisms are important in VTS since often the beneficial microbes in the patient’s gut microbiota can be adversely affected by drug therapy. And of course there need to be more proteins represented for the problematic microorganisms in infectious diseases.

We intend also to include more proteins involved in protein-protein interactions. The target sites may be more difficult to isolate for Glide grids and so multiple copies of the protein structure with grids at different sites could be used. We can use Schrödinger’s SiteMap utility to identify potential binding sites as part of the protein preparation and center the Glide grids on

51

those sites.44 Disruptors of protein-protein interactions can conceivably be larger than typical small molecule substrates of enzyme active sites and so we would want to expand VTS to deal with larger MOIs. This would require an additional approach to calibrating the prepared proteins.

We can envision using a designed virtual di-, tri- or tetra-peptide or peptidomimetic library for an additional set of averages to use in calibration and VTS analysis. These averages of larger molecules would be of use for larger MOIs, which can occur with natural products. Another possible improvement is to incorporate into VTS an assessment of ligand binding efficiencies.

We can improve overall interpretation of VTS dockings by taking into account the number of non-hydrogen atoms to avoid inflation of the scores for larger ligands. The basic approach would be to divide the GScore by the number of non-hydrogen atoms. In addition, approaches being developed to assess drug-likeness need to be incorporated, such as the QED approach.46 Another issue to address in improving VTS is the quality of the prepared proteins. MacroModel has been used in VS and VTS for achieving a relaxed state, more in vivo-like, for the protein compared to the original crystal structure. However, we might think of this as local relaxation of protein domains, whereas a thorough molecular dynamics (MD) relaxation of the protein, in which it is put in a virtual environment that simulates the individual water molecules and ions, can relax the protein to a global energy minimum in solution that may be even more pertinent to creating the protein in a realistic state for screening. In the last few years MD applications, such as

Schrödinger’s Desmond, have become available so that we can now implement MD into our protein preparation steps.47 This implies that we should reprocess our existing protein collection with MD as well as apply MD to new proteins. MD requires more powerfully processing power and memory, which we can now access with clusters making these improvements feasible.

52

VTS and VS are developed primarily towards screening for non-covalent interactions of ligands and proteins. High-throughput virtual screening of potential covalent inhibitors has not been developed to our knowledge. This would be an interesting addition but does present many difficulties, even to develop a limited application. It would require a list of potential reactant groups, identifying key reactive residues in a protein that could be in play, and an algorithm to compare distances, charges and intermediates. It would seem to preclude the actions of cofactors and other agents. So covalent screening, at least with current tools, would be too challenging.

However, it is interesting to contemplate such tools and their potential benefits in drug discovery.

Conclusion

This work has shown the potential for VTS with natural products to identify targets that may represent new purposes for the MOI, possible promiscuity of an MOI, or possible adverse interactions that warrant further investigation. Our VTS system can serve these purposes but we have used this current work to identify issues to improve in order to maximize the use and benefits of VTS for infectious disease drug discovery projects, particularly with newly acquired marine natural products.

Using molecular modeling, chemo- and bioinformatics to search for biomolecular targets of

vitamin E δ-tocotrienol

Introduction

Vitamins, a subclass of natural products, have long been thought to play an important role 53

in physiology of living organisms. Many essential processes in the human body rely on the availability of vitamins. At the same time, while few dispute the importance of vitamins, in many cases their precise mode of action that manifests into health benefits remains elusive. This is not surprising however, as understanding the underpinnings of a biological effect of a small molecule requires the knowledge of its biomolecular target. Zeroing in on the precise mode of action of many vitamins is made more complicated by the fact that many come in various forms and induce a range of physiological responses. Vitamin E (VE) is good example of such complexity. VE is an essential lipid-soluble vitamin and an important macronutrient that has long been thought to have strong antioxidant effects without causing major toxicity in humans.48–50

Its primary activity has been attributed to its ability to reduce free radicals to prevent lipid peroxidation of polyunsaturated fatty acids.51 VE consist of 8 naturally occurring isomers, d- alpha-, d-beta-, d-gamma-, and d-delta-tocopherols and d-alpha-, d-beta-, d-gamma-, and d-delta- tocotrienols.52 Among the constituents of VE, tocopherols have initially received a great deal of attention from scientific community due to their potent antioxidant properties and their abundance in common food sources. Recently however the focus has been shifting towards tocotrienols which, although rare in nature, have exhibited a number of physiological effects that include neuroprotective, antioxidant, anticancer, cholesterol lowering and other therapeutic activities.53–57 The diversity and, at the same time, specificity of physiological responses to tocotrienols may mean that their modes of action differ from the broad antioxidant activity of tocopherols. A quick look at the structures of tocopherols and tocotrienols reveal three double bonds in the farnesyl isoprenoid tail of the tocotrienols making it a less flexible of the two classes of compounds (Figure 11). This additional restraint on the conformational freedom of tocotrienols however may be the reason for the more specific range of activities. The

54

significantly reduced number of allowable conformations of tocotrienols over tocopherols may result in the increased affinity of tocotrienols towards a specific group of binding partners. If this is true, tocotrienols' activity goes beyond that of a broad antioxidant and into the signaling realm.

Studies of Vitamin E delta-tocotrienol (VEDT) have already proposed the notion that at least

VEDT's anti-cancer activity may be attributed not to its antioxidant properties but rather to its involvement in the signaling pathways of cancer.58–60 Furthermore VEDT was proposed to act as a mediating substance in antiproliferative and apoptotic mechanisms in carcinogenic tumor cells.52,56,58,61–64 Present study attempts to gain insight into possible modes of action of the

Vitamin E δ-tocotrienol (VEDT) by using molecular modeling, chemo- and bioinformatics to propose VEDT's likely binding partners.

To complement experimental efforts directed at pinpointing molecular target(s) of VEDT this study takes advantage of in silico approaches to interrogate the protein space for potential targets. As the in silico target discovery gained prominence as viable tool in the at various stages of research campaigns, a number of methodologies have been developed to take advantage of computational efficiency and storage capabilities of computer systems to search for potential protein targets.65–71

In this study a number of different approaches were used to computationally probe for potential target proteins of VEDT. Moreover, the methods used in this study vary significantly in their approach yet aim to solve the same problem. This was done in part to decrease the effect of errors resulting from each individual tool. More importantly the results drawn from the consensus reached via the utilization of different approaches have shown to produce more reliable results.72–75 One method called Virtual Target Screening (VTS) is based on the explicit

55

δ-tocopherol

δ-tocotrienol

Figure 11. Comparison of the structures of δ-tocopherol and δ-tocotrienol. The additional double bonds of δ-tocotrienol make it more conformationally constrained which can lead to its increase preference towards particular type of targets.

modeling the intermolecular interactions between VEDT and its potential targets via a docking protocol designed specifically to search for targets of small drug- or natural product-like compounds.10,76 Other methods PharmMapper77,78 and PASS (Prediction of Activity Spectra of

Substances)79–81 take a chemoinformatics approach to proposing potential target of a molecule of interest. Here the structure of VEDT and its pharmacophore is compared to a database of structures and pharmacophores of small molecules with known biomolecular targets. The target proteins are then proposed based on the structural similarities between database compounds and

VEDT. Lastly binding site analysis tool ProBiS82–85 was used to mine the Protein Data Bank86,87 for proteins containing binding pockets similar to those that are likely to accommodate VEDT.

56

Methods

VEDT Preparation.

3D coordinates of VEDT and δ-tocopherol were downloaded from PubChem (citation) in an SDF (Structure Data File) format and were prepared using a LigPrep27 module of the suite of molecular modeling software Schrodinger26 with default settings corresponding to physiological pH.

Virtual Target Screening.

VTS, a web-based software deployed on a private computer network of the Moffitt

Cancer Center10,76 was used as a molecular modeling component of this study. VTS was used to identify potential binding partners of VEDT by the means of molecular docking. The MOIs

(VEDT and δ-tocopherol) were docked into each of the protein in the library. In addition to human protein structures which comprise the majority of the VTS library (~1000) structures from other organisms ranging from bacteria to mammals were also present and were included in

VEDT screen for completeness. Once docked, MOIs’ docking performance as measured by their docking scores were compared to those of the calibration set compounds for each protein. A protein was considered a hit if MOI’s docking score was better than the average docking score of the top 200 calibration compounds docked into the protein.

PharmMapper.

PharmMapper uses pharmacophore mapping to identify potential targets for the MOI. Six 57

pharmacophore features including hydrophobic center, positively charged center, negatively charged center, hydrogen bond acceptor and donor vectors and aromatic plane of an ensemble of the MOI conformations are mapped onto a library of pharmacophores extracted from publicly available crystal structures of protein-ligand complexes. A fit score between the MOI's pharmacophores and those in the library is then calculated and N best fitting targets are suggested. The SDF file of VEDT structure downloaded from PubChem was converted to Tripos mol2 file using OpenBabel software (citation). The resulting mol2 file was submitted to

PharmMapper server at http://59.78.96.61/pharmmapper.

PASS (Prediction of Activity Spectra of Substances).

PASS predicts biological activity of the MOI based on the Multilevel Neighborhoods of

Atoms (MNA)88 structural descriptors of compounds and a training set of structure-activity relationship (SAR) data for over 60,000 chemical substances (SAR Base). For each biological activity, based on the similarity of MNA descriptors of the MOI and the substances in the SAR

Base PASS outputs two probabilities, the probability of the MOI to exhibit the activity and the probability of the MOI to not exhibit the activity. To execute PASS prediction SMILES chemical identifier representing VEDT -

CC1=C2C(=CC(=C1)O)CCC(O2)(C)CCC=C(C)CCC=C(C)CCC=C(C)C was extracted from

PubChem entry for VEDT and submitted to http://www.pharmaexpert.ru/passonline/predict..

58

ProBiS (Protein Binding Sites Detection).

ProBiS identifies in the PDB (Protein Data Bank) proteins structurally similar to the user-supplied protein of interest (POI). The algorithm behind ProBiS server represents the entire proteins as graphs where vertices correspond to functional groups of surface amino acids and edges represent distance between these functional groups. Pairwise alignment of structural features of the POI and the proteins in PDB is then performed and results are displayed sorted by the statistical Z-scores, with the proteins most similar to POI displayed on top. PDB code for

Estrogen Receptor β (ERβ) 1NDE89 and was used as the input for the ProBiS search. The choice of the above PDB code was based on the results of the VTS screen.

Results

The VTS screen of both VEDT and δ-tocopherol have resulted in a number of proteins that were hits for VEDT but not for δ-tocopherol (Table 4).

Table 4. VEDT and δ-tocopherol VTS results. Hit - proteins into which an MOI docked with a score that outperformed the top 200 molecules of the calibration set. Hit* - identifies top hits; proteins into which an MOI docked with a score that outperformed the top 20 molecules of the calibration set.

δ- VEDT tocopherol [Native ligand] Protein PDB Code Result Result [Mon]Estrogen receptor beta 1NDE Hit* Not a Hit [ZEB]Cytidine Deaminase 1CTU Hit* Hit [MOF]Progesterone receptor 1SR7 Hit* Hit [Proflavin]Alpha-Thrombin 1BCU Hit Not a Hit

59

Table 4 (Continued)

[CL3]Cell division protein zipA 1Y2G Hit Not a Hit [D91]Coagulation factor x- heavy chain 1WU1 Hit Not a Hit [SRL]Orphan nuclear receptor pxr 1ILH Hit Not a Hit [_U]Cytidine Deaminase 1AF2 Hit Hit [TTB]Retinoic acid receptor beta 1XAP Hit Not a Hit [442]Thyroid hormone receptor beta-1 1R6G Hit Not a Hit [L79]Retinoic acid receptor rxr-alpha 1RDT Hit Not a Hit [STR]Igg1-kappa db3 fab (light chain) 1DBB Hit Not a Hit [E1P]Peptidyl-prolyl cis-trans isomerase a 1W8M Hit Not a Hit [AO5]Methionine aminopeptidase 2 1R58 Hit Not a Hit [SDK]Cathepsin K 1AU0 Hit Not a Hit [FSN]Thrombin light chain 1OYT Hit Not a Hit [R11]Coagulation factor x 1G2M Hit Not a Hit [HEM]albumin (no iron) 1N5U Hit Not a Hit [ANO]Igg1-kappa db3 fab (light chain) 1DBK Hit Not a Hit [L10]Mitogen-activated protein kinase 14 1W82 Hit Not a Hit [AB8]Integrin alpha-I 1XDG Hit Not a Hit [FPP]Protein farnesyltransferase 1FPP Hit Not a Hit [IGN]Prothrombin 1K21 Hit Not a Hit [DHZ]Cytidine Deaminase 1CTT Hit Not a Hit [MSC]HIV-1 Protease 1D4J Hit Not a Hit [L08]Integrin alpha-I 1RD4 Hit Not a Hit {CI1031(Z34)]Coagulation Factor XA 1FJS Hit Not a Hit [BLN]Cathepsin s 1MS6 Hit Not a Hit [AIH]Estrogen receptor 1XP1 Hit Not a Hit [CMB]Blood coagulation factor xa 1LPZ Hit Not a Hit [RTR]Coagulation factor xa- heavy chain 1NFY Hit Not a Hit [EST]Estradiol receptor 1QKT Hit Not a Hit [STU]Tyrosine-Protein Kinase zap-70 1U59 Hit Not a Hit [Palmitate]Lipocalin Beta-Lactoglobulin 1B0O Hit Hit [AIU]Estrogen receptor 1XP6 Hit Not a Hit

60

Table 4 (Continued)

[AAY]Integrin alpha-I 1XDD Hit Not a Hit [HYC]Type117 beta-hydroxysteroid dehydrogenase 1I5R Hit Not a Hit [CIU]Epoxide Hydrolase 1EK1 Hit Not a Hit [165]Prothrombin 1SB1 Hit Hit [CIU]Epoxide hydrolase 2- cytoplasmic 1VJ5 Hit Hit [HYF]Orphan nuclear receptor pxr 1M13 Hit Hit [337]map kinase 14 3CTQ Hit Hit

According to VTS, ERβ (PDB code 1NDE) was a top hit for VEDT and while not being a hit for δ-tocopherol (Figure 12). 33 other proteins were also predicted as potential hits for

VEDT but not for δ-tocopherol. This is consistent with some experimental findings that support binding of VEDT to ERβ.55,60 Moreover some patterns or at least consistencies may be observed by looking at the VTS hit list. In particular, a number of hormone and nuclear receptors can be seen in the proposed list of targets. This is not surprise however, as ProBis search for binding sites similar to the ERβ reveals that the binding sites of a number of proteins such as Rxr-like protein, retinoic acid receptor rxr-α, progesterone receptor and others are indeed quite similar

(Table 5).

PharmMapper prediction resulted in 300 proposed pharmacophores with those fitting the flexible alignment with VEDT ranked on top. These results were examined for consistency with the best VEDT hits from VTS screen. Ranked 51 was ERβ based on the pharmacophore derived from the PDB structure 1NDE, same structure that represented the best hit during VTS screen.

PASS prediction resulted in proposed 500 activities which VEDT statistically is more

61

Figure 12. Docked pose of VEDT as identified by the VTS. VEDT, colored in magenta forms hydrogen bonds (yellow dashed lines) with GLU 305 and ARG 346.

Table 5. Results of the ProBis query based on the PDB ID 1NDE input structure. The list reveals structures in the Protein Data Bank that contain binding site similar to that of ERβ as represented by the structure under the PDB ID 1NDE.

PDB ID Protein Name 2j7y ESTROGEN RECEPTOR BETA 3uud ESTROGEN RECEPTOR 3ltx ESTROGEN RECEPTOR 2e2r ESTROGEN-RELATED RECEPTOR GAMMA 3mnp GLUCOCORTICOID RECEPTOR 2q1h ANCCR

62

Table 5 (Continued)

1xiu RXR-LIKE PROTEIN 4fne STEROID RECEPTOR 2 3k6p STEROID HORMONE RECEPTOR ERR1 1g2n ULTRASPIRACLE PROTEIN 4e2j ANCESTRAL GLUCOCORTICOID RECEPTOR 2 3vhv MINERALOCORTICOID RECEPTOR 1t7r ANDROGEN RECEPTOR 2p1t RETINOIC ACID RECEPTOR RXR-ALPHA 3ry9 ANCESTRAL GLUCOCORTICOID RECEPTOR 1 1z5x ULTRASPIRACLE PROTEIN (USP) A HOMOLOGUE OF RXR 3dzu RETINOIC ACID RECEPTOR RXR-ALPHA 1sr7 PROGESTERONE RECEPTOR 3plz FTZ-F1 RELATED PROTEIN 1hg4 ULTRASPIRACLE 1lbd RETINOID X RECEPTOR 3eyb NUCLEAR HORMONE RECEPTOR RXR 4iqr HEPATOCYTE NUCLEAR FACTOR 4-ALPHA 2gl8 RETINOIC ACID RECEPTOR RXR-GAMMA 2q60 RETINOID X RECEPTOR 4j5x RETINOIC ACID RECEPTOR RXR-ALPHA, NUCLEAR RECEPTOR

1lv2 COACTIVATORHEPATOCYTE NUCLEAR 1 FACTOR 4-GAMMA 3dzu PEROXISOME PROLIFERATOR-ACTIVATED RECEPTOR GAMMA 3kmr RETINOIC ACID RECEPTOR ALPHA

likely to exhibit than not. The results were ranked by the probability that VEDT exhibits a given activity. Similarly these results were examined for consistency with the VTS screen and

PharmMapper prediction. 63

According to the PASS prediction there are several activities related either to estrogen modulation or the ERβ specifically (Table 6).

Table 6. PASS prediction results. Predicted activities with relation to the estrogen modulation or action of ERβ. Rank – overall PASS rank of this activity by VEDT. Pa – Probability that VEDT exhibits the given activity. Pi – Probability that VEDT does not exhibit the given activity.

Rank Pa Pi Activity

213 0,177 0,022 Estrogen antagonist

234 0,148 0,024 Estrogen receptor beta

247 0,140 0,026 antagonistEstrogen agonist 371 0,048 0,021 Estrogen beta receptor

agonist According to PASS, among the estrogen related activities VEDT is likely to serve as estrogen antagonist with highest probability, followed by ERβ antagonist, followed by estrogen agonist, followed by ERβ agonist. Based on this result, VEDT is more likely to serve as an antagonist of estrogen and ERβ activity rather than their agonist.

Discussion

Consensus between Computational Studies

Identification the potential molecular targets of VEDT has been attempted via a combination of molecular modeling (docking), chemoinformatics (SAR) techniques and bioinformatics in the form of binding site analysis, The results show consistency in identifying

ERβ as a potential target of VEDT. All three approaches VTS, PharmMapper and PASS have identified ERβ modulation as one of the potential activities of VEDT. In particular VTS has 64

identified ERβ as represented by PDB ID 1NDE as a top target from a list of 1451 protein structures. Although PharmMapper and PASS algorithms operate on a substantially larger data space, still they were able to identify ERβ as a potential target, although didn’t rank it as high as

VTS. Moreover PharmMapper exhibited consistency in preference towards the conformation of the ERβ represented by 1NDE, hinting at possible antagonistic mechanism of action.60,89 The hypothesis of the antagonistic mode of action can also be favored based on the results of PASS which favored an antagonistic activity of VEDT towards ERβ than that of an agonistic.

Consistency with Experimental Data

Although no experimental data was considered during the initial VTS screening of

VEDT, subsequent literature search revealed that Comitato and colleagues had performed both the in vitro binding analyses and molecular docking studies to identify a high affinity interaction between VEDT and ERβ, a first evidence for such an association reported59,60. The consistency of experimental data with the computational approaches reported in this work is encouraging and gives validity to the use of tools such as VTS to probe for potential molecular targets of compounds with an unknown mode of action. In addition the use of the techniques described in this work allows for additional inferences regarding the precise modulating effect of the MOI on its target. Although Comitato and co-workers, based on the docking studies, favored an agonistic mode of action and seem to contradict the preference towards an antagonistic activity favored by our results, the combination of different approaches reported here can be beneficial in facilitating the investigation of the true mode of action of the MOI.

65

Overall, the use of VTS, particularly in combination with other methods aimed at achieving similar goals via different approaches, is a viable strategy in helping identify potential targets of natural products and other chemical substances. Moreover utilization of bioinformatics approaches like ProBis can help gain additional insight into potential targets and help identify structures for further inclusion into the VTS and chemoinformatics screens. This multi-pronged consensus approach may prove especially valuable in cases where it is not feasible or otherwise prohibitive to conduct experimental studies to address this question.

List of References

(1) Ewing, T. J. A.; Makino, S.; Skillman, A. G.; Kuntz, I. D. DOCK 4.0: Search strategies for automated molecular docking of flexible molecule databases. J. Comput. Aided. Mol. Des. 15, 411–428.

(2) NCI Diversity Set was provided by the Developmental Therapeutics Program, D. of; Cancer Treatment and Diagnosis, National Cancer Institute, 6130 Executive Blvd., R. 8020; Rockville, M. 20852. D. of N. collections in the D. diversity collections are available at:; Http://dtp.nci.nih.gov/branches/dscb/div2_explanation.html. No Title. .

(3) Sotriffer, Christoph., Manhold, R. Applications and success stories in virtual screening. In Virtual Screening: Principles, Challenges, and Practical Guidelines; Wiley-VCH Verlag GmbH & Co, 2011; pp 319–358.

(4) Vangrevelinghe, E.; Zimmermann, K.; Schoepfer, J.; Portmann, R.; Fabbro, D.; Furet, P. Discovery of a Potent and Selective Protein Kinase CK2 Inhibitor by High-Throughput Docking. J. Med. Chem. 2003, 46, 2656–2662.

(5) Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D. T.; Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.; Francis, P.; Shenkin, P. S. Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J. Med. Chem. 2004, 47, 1739–1749.

(6) Hui-fang, L.; Qing, S.; Jian, Z.; Wei, F. Evaluation of various inverse docking schemes in multiple targets identification. J. Mol. Graph. Model. 2010, 29, 326–330.

66

(7) Li, Y. Y.; An, J.; Jones, S. J. M. A large-scale computational approach to drug repositioning. Genome Inform. 2006, 17, 239–247.

(8) Do, Q.-T.; Lamy, C.; Renimel, I.; Sauvan, N.; André, P.; Himbert, F.; Morin-Allory, L.; Bernard, P. Reverse Pharmacognosy: Identifying Biological Properties for Plants by Means of their Molecule Constituents: Application to Meranzin. Planta Med. 2007, 73, 1235–1240.

(9) Grinter, S. Z.; Liang, Y.; Huang, S.-Y.; Hyder, S. M.; Zou, X. An inverse docking approach for identifying new potential anti-cancer targets. J. Mol. Graph. Model. 2011, 29, 795–799.

(10) Santiago, D. N.; Pevzner, Y.; Durand, A. A.; Tran, M.; Scheerer, R. R.; Daniel, K.; Sung, S.-S.; Woodcock, H. L.; Guida, W. C.; Brooks, W. H. Virtual target screening: validation using kinase inhibitors. J. Chem. Inf. Model. 2012, 52, 2192–2203.

(11) Schrödinger. MacroModel. MacroModel, 2012.

(12) Schrodinger, L. L. C. Maestro. Maestro, 2012.

(13) Newman, D. J.; Cragg, G. M. Natural products as sources of new drugs over the 30 years from 1981 to 2010. J. Nat. Prod. 2012, 75, 311–335.

(14) Mishra, B. B.; Tiwari, V. K. Natural products: an evolving role in future drug discovery. Eur. J. Med. Chem. 2011, 46, 4769–4807.

(15) Bachmann, B. O.; Van Lanen, S. G.; Baltz, R. H. Microbial genome mining for accelerated natural products discovery: is a renaissance in the making? J. Ind. Microbiol. Biotechnol. 2014, 41, 175–184.

(16) Williams, P.; Sorribas, A.; Howes, M.-J. R. Natural products as a source of Alzheimer’s drug leads. Nat. Prod. Rep. 2011, 28, 48–77.

(17) Kuete, V.; Alibert-Franco, S.; Eyong, K. O.; Ngameni, B.; Folefoc, G. N.; Nguemeving, J. R.; Tangmouo, J. G.; Fotso, G. W.; Komguem, J.; Ouahouo, B. M. W.; Bolla, J.-M.; Chevalier, J.; Ngadjui, B. T.; Nkengfack, A. E.; Pagès, J.-M. Antibacterial activity of some natural products against bacteria expressing a multidrug-resistant phenotype. Int. J. Antimicrob. Agents 2011, 37, 156–161.

(18) Cragg, G. M.; Newman, D. J. Natural products: a continuing source of novel drug leads. Biochim. Biophys. Acta 2013, 1830, 3670–3695.

(19) Dayan, F. E.; Owens, D. K.; Duke, S. O. Rationale for a natural products approach to herbicide discovery. Pest Manag. Sci. 2012, 68, 519–528.

67

(20) Swinney, D. C.; Anthony, J. How were new medicines discovered? Nat. Rev. Drug Discov. 2011, 10, 507–519.

(21) Bugni, T. S.; Harper, M. K.; McCulloch, M. W. B.; Reppart, J.; Ireland, C. M. Fractionated marine invertebrate extract libraries for drug discovery. Molecules 2008, 13, 1372–1383.

(22) Molinski, T. F. NMR of natural products at the “nanomole-scale.”Nat. Prod. Rep. 2010, 27, 321.

(23) Santiago, D. N.; Pevzner, Y.; Durand, A. A.; Tran, M.; Scheerer, R. R.; Daniel, K.; Sung, S.-S.; Woodcock, H. L.; Guida, W. C.; Brooks, W. H. Virtual target screening: validation using kinase inhibitors. J. Chem. Inf. Model. 2012, 52, 2192–2203.

(24) Online data searching is available through the NCI DTP website “data search”; Http://dtp.nci.nih.gov/docs/dtp_search.html. No Title. .

(25) The Florida Center of Excellence in Drug Discovery and Innovation (CDDI) is described; Http://www.research.usf.edu/cddi/drugdiscovery/resources.asp. No Title. .

(26) Schrödinger. Suite 2012: Maestro. Suite 2012: Maestro, 2012.

(27) Schrodinger. Suite 2012: LigPrep. Suite 2012: LigPrep, 2012.

(28) Wang, Y.; Xiao, J.; Suzek, T. O.; Zhang, J.; Wang, J.; Zhou, Z.; Han, L.; Karapetyan, K.; Dracheva, S.; Shoemaker, B. A.; Bolton, E.; Gindulyte, A.; Bryant, S. H. PubChem’s BioAssay Database. Nucleic Acids Res. 2012, 40, D400–12.

(29) Bolton, E.; Wang, Y.; Thiessen, P. A.; Bryant, S. H. PubChem: Integrated Platform of Small Molecules and Biological Activities. PubChem: Integrated Platform of Small Molecules and Biological Activities. Annual Reports in Computational Chemistry Volume 4, 2008.

(30) Halgren, T. A.; Murphy, R. B.; Friesner, R. A.; Beard, H. S.; Frye, L. L.; Pollard, W. T.; Banks, J. L. Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. J. Med. Chem. 2004, 47, 1750–1759.

(31) The PyMOL Molecular Graphics System. The PyMOL Molecular Graphics System.

(32) Zhu, X.; Kim, J. L.; Newcomb, J. R.; Rose, P. E.; Stover, D. R.; Toledo, L. M.; Zhao, H.; Morgenstern, K. A. Structural analysis of the lymphocyte-specific kinase Lck in complex with non-selective and Src family selective kinase inhibitors. Structure 1999, 7, 651–661.

(33) Goto, M.; Miyahara, I.; Hirotsu, K.; Conway, M.; Yennawar, N.; Islam, M. M.; Hutson, S. M. Structural determinants for branched-chain aminotransferase isozyme-specific inhibition by the anticonvulsant drug gabapentin. J. Biol. Chem. 2005, 280, 37246–37256. 68

(34) Liu, X.; Ashforth, E.; Ren, B.; Song, F.; Dai, H.; Liu, M.; Wang, J.; Xie, Q.; Zhang, L. Bioprospecting microbial natural product libraries from the marine environment for drug discovery. J. Antibiot. (Tokyo). 2010, 63, 415–422.

(35) Imhoff, J. F.; Labes, A.; Wiese, J. Bio-mining the microbial treasures of the ocean: New natural products. Biotechnol. Adv. 2011, 29, 468–482.

(36) Bauer, R. A.; Wurst, J. M.; Tan, D. S. Expanding the range of “druggable” targets with natural product-based libraries: an academic perspective. Curr. Opin. Chem. Biol. 2010, 14, 308–314.

(37) Mayer, A. M. S.; Glaser, K. B.; Cuevas, C.; Jacobs, R. S.; Kem, W.; Little, R. D.; McIntosh, J. M.; Newman, D. J.; Potts, B. C.; Shuster, D. E. The odyssey of marine pharmaceuticals: a current pipeline perspective. Trends Pharmacol. Sci. 2010, 31, 255– 265.

(38) Bennani, Y. L. Drug discovery in the next decade: innovation needed ASAP. Drug Discov. Today 2011, 16, 779–792.

(39) Reker, D.; Rodrigues, T.; Schneider, P.; Schneider, G. Identifying the macromolecular targets of de novo-designed chemical entities through self-organizing map consensus. Proc. Natl. Acad. Sci. U. S. A. 2014, 111, 4067–4072.

(40) Keiser, M. J.; Setola, V.; Irwin, J. J.; Laggner, C.; Abbas, A. I.; Hufeisen, S. J.; Jensen, N. H.; Kuijer, M. B.; Matos, R. C.; Tran, T. B.; Whaley, R.; Glennon, R. A.; Hert, J.; Thomas, K. L. H.; Edwards, D. D.; Shoichet, B. K.; Roth, B. L. Predicting new molecular targets for known drugs. Nature 2009, 462, 175–181.

(41) Liu, X.; Vogt, I.; Haque, T.; Campillos, M. HitPick: a web server for hit identification and target prediction of chemical screenings. Bioinformatics 2013, 29, 1910–1912.

(42) Gong, J.; Cai, C.; Liu, X.; Ku, X.; Jiang, H.; Gao, D.; Li, H. ChemMapper: a versatile web server for exploring pharmacology and chemical structure association based on molecular 3D similarity method. Bioinformatics 2013, 29, 1827–1829.

(43) Iorio, F.; Isacchi, A.; di Bernardo, D.; Brunetti-Pierri, N. Identification of small molecules enhancing autophagic function from drug network analysis. Autophagy 2010, 6, 1204– 1205.

(44) SiteMap. SiteMap, 2012.

(45) Halgren, T. A. Identifying and characterizing binding sites and assessing druggability. J. Chem. Inf. Model. 2009, 49, 377–389.

(46) Bickerton, G. R.; Paolini, G. V; Besnard, J.; Muresan, S.; Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 2012, 4, 90–98. 69

(47) Schrodinger Release 2014-1: Desmond Molecular Dynamics System. Schrodinger Release 2014-1: Desmond Molecular Dynamics System, 2014.

(48) Sen, C. K.; Khanna, S.; Roy, S. Tocotrienols: Vitamin E beyond tocopherols. Life Sci. 2006, 78, 2088–2098.

(49) Constantinou, C.; Papas, A.; Constantinou, A. I. Vitamin E and cancer: An insight into the anticancer activities of vitamin E isomers and analogs. Int. J. Cancer 2008, 123, 739–752.

(50) Radhakrishnan, A.; Tudawe, D.; Chakravarthi, S.; Chiew, G.; Haleagrahara, N. Effect of γ-tocotrienol in counteracting oxidative stress and joint damage in collagen-induced arthritis in rats. Exp. Ther. Med. 2014, 7, 1408–1414.

(51) Van Poppel, G.; van den Berg, H. Vitamins and cancer. Cancer Lett. 1997, 114, 195–202.

(52) Malafa, M. P.; Fokum, F. D.; Mowlavi, A.; Abusief, M.; King, M. Vitamin E inhibits melanoma growth in mice. Surgery 2002, 131, 85–91.

(53) Liu, J.; Lau, E. Y.-T.; Chen, J.; Yong, J.; Tang, K.; Lo, J.; Ng, I. O.-L.; Lee, T. K.-W.; Ling, M.-T. Polysaccharopeptide enhanced the anti-cancer effect of gamma-tocotrienol through activation of AMPK. BMC Complement. Altern. Med. 2014, 14, 303.

(54) Nasr, M.; Nafee, N.; Saad, H.; Kazem, A. Improved antitumor activity and reduced cardiotoxicity of epirubicin using hepatocyte-targeted nanoparticles combined with tocotrienols against hepatocellular carcinoma in mice. Eur. J. Pharm. Biopharm. 2014, 88, 216–225.

(55) Nakaso, K.; Tajima, N.; Horikoshi, Y.; Nakasone, M.; Hanaki, T.; Kamizaki, K.; Matsura, T. The estrogen receptor β-PI3K/Akt pathway mediates the cytoprotective effects of tocotrienol in a cellular Parkinson’s disease model. Biochim. Biophys. Acta 2014, 1842, 1303–1312.

(56) Ananthula, S.; Parajuli, P.; Behery, F. A.; Alayoubi, A. Y.; El Sayed, K. A.; Nazzal, S.; Sylvester, P. W. Oxazine derivatives of γ- and δ-tocotrienol display enhanced anticancer activity in vivo. Anticancer Res. 2014, 34, 2715–2726.

(57) Sylvester, P. W.; Shah, S. J. Mechanisms mediating the antiproliferative and apoptotic effects of vitamin E in mammary cancer cells. Front. Biosci. 2005, 10, 699–709.

(58) Ng, L.-T.; Lin, L.-T.; Chen, C.-L.; Chen, H.-W.; Wu, S.-J.; Lin, C.-C. Anti-melanogenic effects of δ-tocotrienol are associated with tyrosinase-related proteins and MAPK signaling pathway in B16 melanoma cells. Phytomedicine 2014, 21, 978–983.

(59) Comitato, R.; Leoni, G.; Canali, R.; Ambra, R.; Nesaretnam, K.; Virgili, F. Tocotrienols activity in MCF-7 breast cancer cells: involvement of ERbeta signal transduction. Mol. Nutr. Food Res. 2010, 54, 669–678. 70

(60) Comitato, R.; Nesaretnam, K.; Leoni, G.; Ambra, R.; Canali, R.; Bolli, A.; Marino, M.; Virgili, F. A novel mechanism of natural vitamin E tocotrienol activity: involvement of ER signal transduction. AJP Endocrinol. Metab. 2009, 297, E427–E437.

(61) Yamasaki, M.; Nishimura, M.; Sakakibara, Y.; Suiko, M.; Morishita, K.; Nishiyama, K. Delta-tocotrienol induces apoptotic cell death via depletion of intracellular squalene in ED40515 cells. Food Funct. 2014, 5, 2842–2849.

(62) Husain, K.; Centeno, B. A.; Chen, D.-T.; Hingorani, S. R.; Sebti, S. M.; Malafa, M. P. Vitamin E -Tocotrienol Prolongs Survival in the LSL-KrasG12D/+;LSL- Trp53R172H/+;Pdx-1-Cre (KPC) Transgenic Mouse Model of Pancreatic Cancer. Cancer Prev. Res. 2013, 6, 1074–1083.

(63) Hodul, P. J.; Dong, Y.; Husain, K.; Pimiento, J. M.; Chen, J.; Zhang, A.; Francois, R.; Pledger, W. J.; Coppola, D.; Sebti, S. M.; Chen, D.-T.; Malafa, M. P. Vitamin E δ- tocotrienol induces p27(Kip1)-dependent cell-cycle arrest in pancreatic cancer cells via an E2F-1-dependent mechanism. PLoS One 2013, 8, e52526.

(64) Husain, K.; Centeno, B. A.; Chen, D.-T.; Fulp, W. J.; Perez, M.; Zhang Lee, G.; Luetteke, N.; Hingorani, S. R.; Sebti, S. M.; Malafa, M. P. Prolonged survival and delayed progression of pancreatic intraepithelial neoplasia in LSL-KrasG12D/+;Pdx-1-Cre mice by vitamin E -tocotrienol. Carcinogenesis 2013, 34, 858–863.

(65) Vankayala, S. L.; Hargis, J. C.; Woodcock, H. L. How Does Catalase Release Nitric Oxide? A Computational Structure–Activity Relationship Study. J. Chem. Inf. Model. 2013, 53, 2951–2961.

(66) Hargis, J. C.; Vankayala, S. L.; White, J. K.; Woodcock, H. L. Identification and Characterization of Noncovalent Interactions That Drive Binding and Specificity in DD- Peptidases and β-Lactamases. J. Chem. Theory Comput. 2014, 10, 855–864.

(67) Terstappen, G. C.; Reggiani, A. In silico research in drug discovery. Trends Pharmacol. Sci. 2001, 22, 23–26.

(68) Pauly, G. T.; Loktionova, N. A.; Fang, Q.; Vankayala, S. L.; Guida, W. C.; Pegg, A. E. Substitution of Aminomethyl at the Meta-Position Enhances the Inactivation of O 6 - Alkylguanine-DNA Alkyltransferase by O 6 -Benzylguanine. J. Med. Chem. 2008, 51, 7144–7153.

(69) Vankayala, S. L.; Hargis, J. C.; Woodcock, H. L. Unlocking the Binding and Reaction Mechanism of Hydroxyurea Substrates as Biological Nitric Oxide Donors. J. Chem. Inf. Model. 2012, 52, 1288–1297.

(70) Shoichet, B. K. Virtual screening of chemical libraries. Nature 2004, 432, 862–865.

71

(71) The Nobel Prize in Chemistry 2013. The Nobel Prize in Chemistry 2013 http://www.nobelprize.org/nobel_prizes/chemistry/laureates/2013/ (accessed Apr 4, 2015).

(72) Charifson, P. S.; Corkery, J. J.; Murcko, M. A.; Walters, W. P. Consensus Scoring: A Method for Obtaining Improved Hit Rates from Docking Databases of Three-Dimensional Structures into Proteins. J. Med. Chem. 1999, 42, 5100–5109.

(73) Ece, A.; Sevin, F. The discovery of potential cyclin A/CDK2 inhibitors: a combination of 3D QSAR pharmacophore modeling, virtual screening, and molecular docking studies. Med. Chem. Res. 2013, 22, 5832–5843.

(74) Houston, D. R.; Walkinshaw, M. D. Consensus docking: improving the reliability of docking in a virtual screening context. J. Chem. Inf. Model. 2013, 53, 384–390.

(75) Ji, X.; Zheng, Y.; Wang, W.; Sheng, J.; Hao, J.; Sun, M. Virtual screening of novel reversible inhibitors for marine alkaline protease MP. J. Mol. Graph. Model. 2013, 46, 125–131.

(76) Yuri Pevzner, Daniel N. Santiago, Jacqueline L. von Salm, Rainer S. Metcalf, Kenyon G. Daniel, Laurent Calcul, H. Lee Woodcock, Bill J. Baker, Wayne C. Guida, W. H. B. Virtual target screening to rapidly identify potential protein targets of natural products in drug discovery. AIMS Mol. Sci. 2014, 1, 49–66.

(77) Chen, Z.; Wang, X.; Zhu, W.; Cao, X.; Tong, L.; Li, H.; Xie, H.; Xu, Y.; Tan, S.; Kuang, D.; Ding, J.; Qian, X. Acenaphtho[1,2- b ]pyrrole-Based Selective Fibroblast Growth Factor Receptors 1 (FGFR1) Inhibitors: Design, Synthesis, and Biological Activity. J. Med. Chem. 2011, 54, 3732–3745.

(78) Liu, X.; Ouyang, S.; Yu, B.; Liu, Y.; Huang, K.; Gong, J.; Zheng, S.; Li, Z.; Li, H.; Jiang, H. PharmMapper server: a web server for potential drug target identification using pharmacophore mapping approach. Nucleic Acids Res. 2010, 38, W609–W614.

(79) Borodina, Y. V.; Filimonov, D. A.; Poroikov, V. V. Computer-aided prediction of prodrug activity using the pass system. Pharm. Chem. J. 1996, 30, 760–763.

(80) Filimonov, D. A.; Poroĭkov, V. V; Karaicheva, E. I.; Kazarian, R. K.; Budunova, A. P.; Mikhaĭlovskiĭ, E. M.; Rudnitskikh, A. V; Goncharenko, L. V; Burov, I. V. [The computerized prediction of the spectrum of biological activity of chemical compounds by their structural formula: the PASS system. Prediction of Activity Spectra for Substance]. Eksp. Klin. Farmakol. 58, 56–62.

(81) Lagunin, A.; Filimonov, D.; Poroikov, V. Multi-targeted natural products evaluation based on biological activity prediction with PASS. Curr. Pharm. Des. 2010, 16, 1703–1717.

(82) Konc, J.; Janezic, D. ProBiS: a web server for detection of structurally similar protein binding sites. Nucleic Acids Res. 2010, 38, W436–W440. 72

(83) Konc, J.; Janezic, D. ProBiS algorithm for detection of structurally similar protein binding sites by local structural alignment. Bioinformatics 2010, 26, 1160–1168.

(84) Konc, J.; Cesnik, T.; Konc, J. T.; Penca, M.; Janežič, D. ProBiS-database: precalculated binding site similarities and local pairwise alignments of PDB structures. J. Chem. Inf. Model. 2012, 52, 604–612.

(85) Konc, J.; Janezic, D. ProBiS-2012: web server and web services for detection of structurally similar binding sites in proteins. Nucleic Acids Res. 2012, 40, W214–21.

(86) Berman, H. M. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235–242.

(87) Bernstein, F. C.; Koetzle, T. F.; Williams, G. J. B.; Meyer, E. F.; Brice, M. D.; Rodgers, J. R.; Kennard, O.; Shimanouchi, T.; Tasumi, M. The protein data bank: A computer-based archival file for macromolecular structures. J. Mol. Biol. 1977, 112, 535–542.

(88) Filimonov, D.; Poroikov, V.; Borodina, Y.; Gloriozova, T. Chemical Similarity Assessment through Multilevel Neighborhoods of Atoms: Definition and Comparison with the Other Descriptors. J. Chem. Inf. Model. 1999, 39, 666–670.

(89) Henke, B. R.; Consler, T. G.; Go, N.; Hale, R. L.; Hohman, D. R.; Jones, S. A.; Lu, A. T.; Moore, L. B.; Moore, J. T.; Orband-Miller, L. A.; Robinett, R. G.; Shearin, J.; Spearing, P. K.; Stewart, E. L.; Turnbull, P. S.; Weaver, S. L.; Williams, S. P.; Wisely, G. B.; Lambert, M. H. A new series of estrogen receptor modulators that display selectivity for estrogen receptor beta. J. Med. Chem. 2002, 45, 5492–5505.

73

CHAPTER FOUR:

CHARMMING DRUG DESIGN DATABASE AND ADDITIONAL MODULES

DEVELOPMENT

CHARMMing Drug Design Database

Database design considerations

Successful design of any data driven application relies heavily on its data storage and management backbone. Poor database design choices made at the inception of the project can significantly hinder the development at any point. Moreover, particularly damaging setbacks may result if poor design choices become apparent at later stages of the project life cycle. It seems difficult however to foresee all the use cases of the application, all the functionality and changes in specifications along the way. The dynamic nature of the application is especially characteristic of open source projects where the development contributions often take the project in the directions not initially intended or designed for.

To design a database that can efficiently and intuitively store data for a wide range of applications pertaining to a particular field one does not need to foresee all the use cases of the application. The database should not aim to achieve a particular usage profile. What database should do is to model the particular phenomenon or its targeted “universe” or a field which it is to represent. In the case of CHARMMing drug design database this “universe” is a field of drug design and discovery. Although in the case of CHARMMing, the field is narrowed specifically to

74

computer-aided drug discovery, the underlying data structures should ideally reflect the fact that this field is a part of larger field.

How does modeling the “universe” as a whole help design a better database for an application with particular usage scenarios? Any scenario that can be incorporated in an application and any use case will undoubtedly involve some aspects of the field that the underlying database is designed to model. It can also be said that a particular use case is not likely to involve all possible aspects of the field. At the same time however, a use case will never involve any aspects of fields completely outside of the one modeled by a database. As an example of the former case, a rigid protein docking is not going to care about existence of multiple protein conformations. This doesn’t mean that there shouldn’t be a data structure that is designed to hold such information because flexible-protein docking, which is certainly within the realm of computer-aided drug discovery, will use this knowledge, thus the need for the representative data structures is necessitated. On the other end of the spectrum, the database designer can safely assume that there is no need to incorporate data structures storing the information regarding accounts receivables since accounting is rather different “universe” from computer-aided drug discovery and does not need to be represented in the database.

CHARRMing’s initial functionality already included modeling of proteins and the data structures that represent the target portion of the computer-aided drug discovery “universe” were already in place. To make the model of the entire “universe” more complete, the efforts to develop CHARMMing docking module included supplementing the existing database with data structures representing small molecule world and its relationship and interactions with the target world. Some additional data structures pertaining to targets were also added to more completely

75

represent the aspects of computer-aided drug discovery. The data structures to explicitly store information regarding the following entities were included:

 Target data

o Proteins, protein conformations, binding sites, residues, amino acids

 Ligand/Substrate data

o Ligands, fragments, de novo ligands, poses

 Analysis

o Molecular interactions, scoring, object attributes, metrics

 Infrastructure

o File and job management

Additionally, although explicit data structures weren’t created, a link to the outside databases was established to fetch experimental results for the purpose of creating predictive models for

(Quantitative) Structure-Activity Relationship ((Q)SAR) studies.

Dynamic Relationship Schema

When designing a database for a project that is projected to have frequent addition of new modules and functionality over its lifetime it becomes a challenge to keep the complexity of the database in check while at the same time accommodating for the “unknown” future functionality.

Specifically, number of tables storing new types of objects defined by new modules will grow linearly with the number of new objects. To keep flexibility in assigning attributes to object (i.e. to be able to assign any number of attributes to any number of object) many-to-many relationships will need to be defined between all the object types and their corresponding attributes. To accomplish this conventional database design approach dictates the need for a triad

76

of tables for each type of object to store the data pertaining to the object along with an indefinite number of attributes (Figure 13).

Figure 13. The triad of tables dictated by traditional database design approach to define an object. In this case DockedPose, its attributes and a many-to-many relationship between them, described in the “Docked Pose Attribute Lookup” table.

The total number of tables required to store object-attribute data using this conventional approach is 3*N where N is a number of different types of objects. To curb the growth of the number of tables while, at the same time, allow for flexibility necessary for a project with a yet undefined future functionality an unconventional design approach was taken. In this approach here termed dynamic relationship all of the object types still reside in their respective tables.

However all the attributes, independent of the possible object affiliations are defined in one table.

In addition, and this is the key concept of the approach, all of the object-attribute associations are stored in a single table. To specify which object a particular attribute relates a separate field is used where the name of the table of the particular object type is stored (Figure 14).

77

Figure 14. The layout of the dynamic relationship schema. The objects (enclosed in the black frame) each reside in their respective tables as in the conventional design approach. The attributes all reside in a single table irrespective of their applicability to a particular type of object. ObjectAttributes table specifies the relationship between a specific attribute and a specific object by means of the ObjectID as dynamic foreign key to a table specified in the ObjectTableName field.

The dynamic relationship schema design reduces the number of tables from 3*N to N+2, a three-fold reduction in the number of tables. This design allows linking any object with any attribute even if the link may seem less than intuitive during design time but becomes apparent during development of a particular module. Additionally, because many objects in

CHARMMing have association with data residing outside of the database on a local file system as flat text files same dynamic approach was used to link objects with paths to files on disk. The dynamic relationship approach facilitates any type of future development within and outside of the scope of molecular modeling or computer-aided drug discovery as the design is non-specific yet highly flexible.

. One potential disadvantage of dynamic relationship approach is that the queries to retrieve the data become more complicated. However, in cases where the database access is

78

handled by the ORM (object-relational mapping) as is the case with the Django engine which makes up the backbone of the CHARMMing web server, this is not an issue since the type of objects (i.e. the table name criteria) can be easily specified as an additional tern in the filter clause of the ORM API (application program interface) when interrogating the database.

CHARMMing Fragment-Based de novo Drug Design Module

Introduction

Another in silico approach that is part of the computer-aided drug design methodology is de novo ligand design. In this approach, unlike docking or virtual screening where a set of ligands are docked into the protein binding site, the ligands are constructed de novo. In other words, novel ligands are constructed as a result of this approach. There are a number of various approaches to this method. Atom-based design assembles or grows ligands atom by atom.1

Fragment-based approach starts with a set of low molecular weight fragments and assembles novel ligands using them as building blocks. 2–6 There are advantages and disadvantages to these two approaches. Atom-based approach can sample a larger chemical space and result in a proposal of more diverse set of novel ligands. The disadvantage is however that the proposed ligands are more likely to be difficult to synthesize or even represent highly unrealistic structures.3 Fragment-based approaches, to a degree, address the problem of unrealistic structures by using fragments commonly found in known molecules and result in more drug-like structures.

This however keeps fragment-based approaches largely within the confines of relatively well explored chemical space and sacrifices diversity and novelty for the synthetic feasibility. Still however, it remains a significant challenge to ensure that compounds proposed by de novo

79

approaches are readily synthesizable and much effort has gone into incorporating synthetic feasibility analysis into such approaches.3–6

Fragment-based approaches, in turn utilize variety of methods to both generate the de novo structures and evaluate them in regards to their potential as a substrate for a particular target. Two major components of the approach to the structure generation are ligand-based and structure-based. In ligand-based design, construction of the de novo ligand is guided by reference to other known ligand(s), pharmacophores or scaffolds.7–12 In the case of structure-based approach, the de novo design is guided primarily by the character of the binding site and evaluated solely based on the fragments’ and ligand’s interactions with the target.13–16 Logically, there exist a number of methods where the above two approaches are combined in one way or another. 2,17–19 There are certain trade-offs between choosing a particular approach to fragment- based de novo design. Ligand-based methods somewhat alleviate the problem of synthetic accessibility of proposed ligands if novel ligands are forced by the evaluation conditions to be highly similar to known easily synthesizable structures. On the other hand, such an approach results the diversity and novelty of the proposed structures in comparison to a purely target structure-based approach.

Despite the lingering issue of synthetic feasibility of ligands generated by the fragment- based approaches fragment-based de novo design methods have advantages over conventional virtual screening. Molecular fragments, although highly promiscuous when docked individually result in high ligand efficiency (LE). That is high predicted affinity (docking score) per heavy atom. If a novel compound is constructed from highly efficient fragments, the overall efficiency stays at a high level in a fully assembled ligand without the promiscuity of its individual building

80

blocks. The LE of novel ligands that result from such an approach is likely to be higher than that of an average ligand in a virtual screening library.

Fragment-Based de novo Design Protocol Implementation in CHARMMing

Similarly to the docking protocol described in the chapter two, CHARMMing’s fragment-based de novo ligand design protocol is based on programs developed in the group of

Prof. Caflisch at University of Zurich. 2,20,21 To serve as starting material, a library of 3,252 fragments is pre-loaded into CHARMMing and is accessible to all users. This library is based on the Maybridge Hitfinder™ set (www.maybridge.com) on which public ligand library pre-loaded into CHARMMing is based as well. To generate the fragment set program DAIM

(Decomposition And Identification of Molecules)21 was used with default settings. The fragments were then atom-typed and parametrized for compatibility with CHARMM

Generalized Force Field (CGenFF).22–25

De novo design protocol, outlined in Figure 15 consists of the following steps:

1. Each fragment selected by user is docked into the user-specified binding site using

the program SEED (Solvation Energy for Exhaustive Docking).20 More detailed

mechanism of this fragment docking step is described in chapter two as this step is

identical to the corresponding step of the CHARMMing’s fragment docking

protocol.

2. Program GANDI2 with a CGenFF compatible parameter file created and

customized for this project is then used to join the docked fragments. The

fragments are connected to each other in an iterative fashion using a set of pre-

loaded linker fragments and a procedure that grows the “connected” ligand by

81

randomly linking “disconnected” fragments and evaluating the fitness of the

resulting structure using the island model genetic algorithm. All de novo ligands

consisting of 2 and 3 fragments are evaluated in this manner. The fitness metric is

a GANDI scoring function that incorporates protein-ligand van der Waals and

electrostatic interactions as well, intra ligand interactions between fragments and

linkers and penalties for unfavorable dihedral angles between newly connected

fragments.2

3. Top 20 ligands that have passed the energy cutoff are then atom-typed and

parametrized to CGenFF standards using the program MATCH (Multipurpose

atom-typer for CHARMM)26 for further processing by CHARMM program.

Number of ligands to be kept for further processing is a user-modifiable parameter.

4. Each ligand pre-processed in previous step is then incorporated into a ligand-target

CHARMM compatible system and minimized using the adopted basis Newton-

Raphson (ABNR) for thousand steps while keeping the protein atoms fixed. The

resulting CHARMM interaction energy values are recorded and the final pose of

the ligand within the binding site of the target is also scored with SEED in its

“evaluation only” mode.

5. The consensus ranking approach used in the docking protocol and described in

chapter two is utilized to assign final rankings to the newly constructed ligands.

The consensus ranking is based on three scores GANDI score, CHARMM total

interaction energy, and SEED total energy.

82

Figure 15. Flow chart of the CHARMMing’s fragment-based de novo design procedure. Procedure starts with a pre-loaded library of fragments which are first docked using SEED program. Then GANDI joins docked fragments to form new ligands. Top ligands are then atom- typed and then parametrized using MATCH. Final series of steps involve minimization with CHARMM, evaluation with SEED and consensus ranking using GANDI score, CHARMM and SEED energies.

The specification of binding site is also identical to the docking protocol and allows user to use a co-crystallized ligand, if present, to define a binding site, or define a custom binding site using a graphical point-and-click interface that is part of the CHARMMing’s drug design module.

The fragment-based de novo approach currently implemented in CHARMMing is purely target structure-based and the evaluation of each new ligand’s fitness is only dependent on the energetics of its interaction with the target and itself. GANDI however is a program that combines both structure-based and ligand-based construction and evaluation of the generated

83

ligands. The ligand-based criteria can be incorporated in a number of ways. A user can specify a template ligand which can serve as either a 2D or 3D reference during the de novo design procedure. In this case the similarity metric between each newly constructed ligand and the template is added to the GANDI scoring function with a user-defined weight coefficient. This allows the user to put high emphasis on the similarity aspect of the procedure. Similarly a pharmacophore definition can be supplied in place of the template ligand, analogously affecting the GANDI scoring function. The incorporation of the template and pharmacophore specification into the CHARMMing’s fragment-based de novo design interface is currently under development.

The interface of the CHARMMing’s fragment-based de novo design protocol is almost identical to that of the docking protocol (Figures 16 and 17) and incorporates a list of pre-loaded fragments on the job submission page as well as reflects the nature of the ligand construction protocol in the newly designed ligand names and scores.

CHARMMing QSAR/SAR Module

Introduction

A major component of computer-aided drug discovery is using known information to make informed decisions regarding the direction of a research campaign. Specifically, this involved building predictive models that infer a compound’s potential activity based on its features (mainly structural) and data available from other experiments. The models constructed in the context of this approach fall into two categories, structure-activity relationship (SAR)28 and quantitative structure-activity relationship (QSAR)29. This type of predictive modeling has initially found most application during the lead optimization stage of the discovery campaign

84

Figure 16. CHARMMing’s fragment-based de novo design job submission page. This page differs from docking submission page in the list of starting molecules. Here pre-loaded fragment set consisting of 3252 fragments is presented as a starting material.

where a number of both pharmacodynamics and pharmacokinetic properties can be optimized by iteratively applying (Q)SAR models to lead compounds and their derivatives.30–32 Specifically potency, selectivity and specificity towards a particular target as well as absorption, distribution, secretion, metabolism and excretion collectively called ADME are among properties often sought to be improved using predictive modeling. Among the advantages of using (Q)SAR at early stages of the discovery campaign is the methodology’s ability to help identify undesirable compounds that are potentially highly promiscuous, toxic, or otherwise show lack of potential to 85

Figure 17. CHARMMing’s fragment-based de novo design job details/results page. This page differs from docking job details/results page in the naming of the newly constructed ligands and the scores involved in the consensus ranking. Here the de novo ligands reflect the naming of the genetic algorithm and GANDI score replaces docking’s FFLD27 score.

be developed into a drug. This saves money by helping avoid high expenditures by discarding potentially problematic compounds at earlier stages of the drug discovery pipeline before committing substantial resources to the development and evaluation of drug candidates.33–35

Predictive modeling have found wide use in industrial design, regulatory assessment of pharmaceutical agents, pesticides and other substances.36,37 Additionally, (Q)SAR models have found increasing use as an alternative to animal testing in cases where there is a push to actively steer away from methods that utilize animals for compound testing.38,39 Recently (Q)SAR modeling has been finding applications in conjunction with other computational drug discovery approaches such as virtual screening in addition to applications in substance risk assessment.40 86

However, the predictive performance and quality of (Q)SAR models in many respects lags behind other areas of machine learning applications and warrant improvements.41

As part of the efforts to address some of the shortcoming of (Q)SAR several SAR models have been developed based on such machine learning qualifiers as random forest42,43 and k

Nearest Neighbor Simulated Annealing.44 These efforts helped lead to the identification of novel non-nucleoside chemical motifs as well as Candesartan, a drug used in treatment of hypertension and heart failure, for NS5B, hepatitis C virus RNA polymerase activity.45 This section describes implementation of the above and other models in the context of the development of the new

(Q)SAR module of the CHARMMing web server.46

(Q)SAR Functionality of CHARMMing

(Q)SAR workflow in CHARMMing consists of two main components - model training and prediction. It is necessary to create a predictive model or use existing model before predictive analysis can be performed.

To train a model user can upload an SDF (structure data file) file that contains compounds’ structural information as well as other descriptors and metrics. To perform SAR prediction the training set needs to contain binary (Yes/No, Active/Inactive, etc.) activity metric.

For QSAR models numerical activity values (e.g., IC50) need to be present in the training set.

CHARMMing also provides interface to access PubChem Bioassay database that accepts a search criteria from the user and relays it to PubChem servers via PUG REST (Power User

Gateway Representational State Transfer) and SOAP (Simple Object Access Protocol) interfaces.47–52 The search returns a list of assays with link to the detailed information on

PubChem website and an option to download an SDF representing the structures with

87

corresponding structure/assay metrics. This file can then be directly submitted for model training into CHARMMing (Figure 18). List of models available through CHARMMing interface for training is listed in Table 7. Once the SDF is submitted, the user is asked to choose the property from the SDF (binary for SAR, numerical for QSAR) to use as a metric for model building. Once this selection is made the model building commences. Model training job is processed using

CHARMMing’s job infrastructure utilized by all the other modules.

Once training is complete the model is saved in the database and is only accessible to the user that has created it. User can view all the models created in the past along with the pertinent model information that among other things helps assess the quality of the model and its ability to produce reliable predictions. The screenshot of the model information screen in Figure 19 is an

Figure 18. CHARMMing (Q)SAR model training file upload interface. Categorization models are used for SAR, Regression for QSAR. Currently available models with brief description are listed in Table 7.

example of a Random Forest Categorization model and includes among other metrics the following measures of model quality – cross validation53, y-randomization54, area under the curve (AUC)55, precision and. Pearson correlation coefficient R2 is presented for regression

88

Table 7. List of predictive models available through CHARMMing.

Figure 19. CHARMMing (Q)SAR model details screen. The screen presents user with the basic information about the job as well as common metrics to assess the predictive power and reliability of the model. Additionally, here user can run predictions based on this job.

models. The model details screen is where a user can initiate prediction based on the model by uploading an SDF file with the test set structures.

89

The progress of any training or prediction job can be monitored in real time in a similar manner as other modules implemented in charming. Additionally SAR and QSAR tutorials are provided to introduce the user to the concept of predictive modeling as well as provide a guided walkthrough using the dataset used in the NS5B study.45

Acknowledgements

The authors acknowledge Dr. Paul A. Thiessen, Dr. Evan Bolton, and his team for valuable discussions and for providing core expertise in the area of PubChem. This research was supported in part by the Intramural Research program of the National Heart, Lung, and Blood

Institute, NIH. IEW, BTM, and BRB utilized the high performance computational capabilities of the LoBoS (http://www.lobos.nih.gov/) and Biowulf Linux clusters (http://biowulf.nih.gov) at the National Institutes of Health.

Computational Chemistry Education with CHARMMing

Introduction

One of the primary purposes of CHARMMing is to serve as a bridge between the powerful capabilities of molecular modeling tools and a user that may not have capacity or technical expertise to operate these tools in their “raw” form. Aside from targeting academic labs and scientists that use CHARMMing to facilitate their research, the web interface provides a perfect means to serve another important need – chemistry education, and not necessarily just computational chemistry education. Basic chemistry concepts can be reinforced or approached from a different, possibly more engaging angle via the use of molecular dynamics.56,57 Most, especially non-commercial, molecular modeling software is not easy to use and requires

90

knowlede of program specific scripting syntax in additiona to some technical expertise. It has been argued however, that the problem of software complexity can be overcome by “friendly web interfaces” or via the use of bootable media with pre-installed software.58 CHARMMing, of course takes the former approach. Without the need to learn scripting languages or even look at somewhat intimidating program scripts, with a familiar point-and-click style of visual interface and engaging graphics CHARMMing has a potential to capture students that would be turned off by the complexity of molecular modeling. There are several factors that make CHARMMing espeicially attractive as an engaging educational tool:

 Ability to perform many common molecular modeling and drug design tasks

 Ability to monitor progress of jobs in real time

 Ability to perform tasks in different order

 Ability to visualize and graphically manipulate systems

 Availability of tutorials for most tasks incorporated into CHARMMing

To further increase usefullness of CHARMMing as an educational tool an interactive lessons module have been added to the web server and described in a series of papers.59–61 Here an overview of the lessons module is given along with exapmles of lessons incorporated into

CHARMMing and their application as educational tools.

Design and Implementation of CHARMMing Lessons Module

In order to increase CHARMMing’s lessons module’s value as an educational tool, several design considerations and functionalities were taken into account:

 Provide a range of interactive lessons from the basic introductory to more

advanced.

91

 Provide the students with the ability to track their progress in real-time. The

student should be able to know where and at which step they are in a particular lesson, at

as well as how does a particular lesson fit into a greater view of molecular modeling.

 The module should identify mistakes made by the student during the course of the

lesson, provide meaningful feedback and allow correction of such mistakes in order to

proceed with the lesson plan.

 Provide the user with ability to perform different taks to accomplish similar goals

along with explanation of the differences and the applicability domain of each approach.

 Provide the means for educators to create new lessons and workflows for

students.

CHARMMing incorporates primarily a structure centric workflow. That is, before any tasks can be performed a protein structure needs to be provided. This can be done via a query to

PDB or a user specification. At the point of structure submission, a user is given the option to make the structure part of a particular lesson. Once the structure is associated with a lesson, any task that is performed with the structure is compared to the lesson plan for that particular lesson and the feedback to the user is provided. If, for example, user performs minimization on a structure where, according to lesson plan solvation was warranted first, the lesson progress indicates that solvation was not performed and the lesson cannot proceed until it is. Similarly if the task that is performed by the user is the correct task but with incorrect parameters (e.g. number of minimization steps is different from that in the lesson plan) the progress report notes that the minimization needs to be re-run because the parameter value was not correct. The overview of the overall lesson progress is updated in real time and is part of the navigation pane present on every page of the web site. The lesson progress also includes links to the details of

92

each step (Figure 20). Currently a total of seven lessons have been incorporated into

CHARMMing:

1. Lesson one: Introduction. Introduction to the basic features of CHARMMing as

well as basic principles of structure preparation and molecular dynamics

simulation.

2. Lesson two: Simulation proteins. A lesson designed for a more thorough

investigation of molecular dynamics of proteins.

3. Lesson three: Enhanced sampling/Self-guided Langevin dyamics (SGLD).

Intorduces students to a different flavor of protein dynamics to study

conformational changes.

4. Lesson four: Custom residue topology files (RTF), quantum mechanics/molecular

mechanics (QM/MM). An introduction to the hybrid QM/MM simulations and a

closer look at the files representing structure topology in CHARMM .

5. Lesson five: Coarse grained modeling. Introduction to the representation of

protein at a lower resolution by using “beads” representing groups of atoms and

using this approach to investigate larger scale conformational changes in

proteins.60

6. Lesson six: Oxidation/Reduction Calculations. Allows students to investigate the

process of electron donation and utilize the graphical interface to perform amino

acid mutations.61

7. Lesson seven: Molecular docking. An introduction to the concept of docking as a

tool of the computer-aided drug design on an example of a self-dock.

93

In addition to the above lessons CHARMMing provides functionality to create new lessons by “recording” steps performed by a user into a lesson plan. If, for example, user checks

“Start Lesson Recording” when initially uploading the structure, performs solvation, minimization and runs dynamics, all of these steps with the corersponding parameters for each task will be saved as a lesson plan and presented as a custom lesson.

In addition to the lessons module, a series of CHARMM/CHARMMing tutorials are available at http://www.charmmmtutorial.com. The tutorials cover all the concepts on that serve

Figure 20. Examples of lesson progress bar. Three snapshots of a lesson progress of an “Introduction to Simulation” lesson. Student’s progress in the entire lesson is shown. Clicking on “Failed” link takes student to the details of that particular step with explanation of failure and steps necessary to resume the lesson.

as a scientific base for CHARMMing and include walkthroughs of common workflows such as molecular dynamics simulations, docking studies, predictive modeling and others. Together with

CHARMMing lessons these resources play a role beyond education. Most importantly they help ensure proper usage of CHARMMing and the tools it provides, highlight the importance of critically evaluating the results and promote proper scientific practices.

94

Conclusion

The wide range of functionality already available through CHARMMing and currently under development would not be possible without contribution from experts from a variety of fields within and outside of molecular modeling and computer-aided drug discovery. Moreover important design considerations at each step of development made it possible to unify and bring under one umbrella the diversity of the fields of molecular modeling and computer-aided drug discovery. The results of these efforts serve an important need and bring the power and variety of computational tools to those that have least access to them yet could gain the most from their use. With this however, the contributors to CHARMMing as well as to other open source scientific tools carry a burden of responsibility. Although one of the primary goals is that the fruits of their labor help make science more accessible, it is even more important that open- source scientific resources ensure that the science facilitated by these tools is sound. While making complex concepts more palatable to the user, it is imperative to make user realize the limitations of the approaches involved and the fact that results should often be taken with a grain of salt. CHARMMing contributors attempt to accomplish this mission by implementing only published and well-reviewed, well tested methods, by providing tutorials and lessons, by incorporating penalties into custom ligands whose structures are too “different” from existing force field parameters, by including confidence and performance metrics for (Q)SAR predictive models, by providing means to visually inspect the structures and monitor jobs output. The steps to ensure proper use of tools often require substantial effort. However it is the responsibility of the contributors to ensure that the methods implemented as part of these projects have solid scientific foundation. Moreover it is important that the interfaces and the workflows, however

95

friendly they may be, encourage critical evaluation of the results and promote the soundness of the scientific research.

List of References

(1) Luo, Z.; Wang, R.; Lai, L. RASSE: a new method for structure-based drug design. J. Chem. Inf. Comput. Sci. 36, 1187–1194.

(2) Dey, F.; Caflisch, A. Fragment-based de novo ligand design by multiobjective evolutionary optimization. J. Chem. Inf. Model. 2008, 48, 679–690.

(3) Hartenfeller, M.; Schneider, G. De novo drug design. Methods Mol. Biol. 2011, 672, 299– 323.

(4) Konteatis, Z. D. Design strategies for computational fragment-based drug design. Methods Mol. Biol. 2015, 1289, 137–144.

(5) Mortier, J.; Rakers, C.; Frederick, R.; Wolber, G. Computational tools for in silico fragment-based drug design. Curr. Top. Med. Chem. 2012, 12, 1935–1943.

(6) Sheng, C.; Zhang, W. Fragment informatics and computational fragment-based drug design: an overview and update. Med. Res. Rev. 2013, 33, 554–598.

(7) Fechner, U.; Schneider, G. Flux (2): comparison of molecular mutation and crossover operators for ligand-based de novo design. J. Chem. Inf. Model. 47, 656–667.

(8) Fechner, U.; Schneider, G. Flux (1): a virtual synthesis scheme for fragment-based de novo design. J. Chem. Inf. Model. 46, 699–707.

(9) Feher, M.; Gao, Y.; Baber, J. C.; Shirley, W. A.; Saunders, J. The use of ligand-based de novo design for scaffold hopping and sidechain optimization: two case studies. Bioorg. Med. Chem. 2008, 16, 422–427.

(10) Kutchukian, P. S.; Lou, D.; Shakhnovich, E. I. FOG: Fragment Optimized Growth algorithm for the de novo generation of molecules occupying druglike chemical space. J. Chem. Inf. Model. 2009, 49, 1630–1642.

(11) Proschak, E.; Sander, K.; Zettl, H.; Tanrikulu, Y.; Rau, O.; Schneider, P.; Schubert- Zsilavecz, M.; Stark, H.; Schneider, G. From molecular shape to potent bioactive agents II: fragment-based de novo design. ChemMedChem 2009, 4, 45–48.

96

(12) Proschak, E.; Zettl, H.; Tanrikulu, Y.; Weisel, M.; Kriegl, J. M.; Rau, O.; Schubert- Zsilavecz, M.; Schneider, G. From molecular shape to potent bioactive agents I: bioisosteric replacement of molecular fragments. ChemMedChem 2009, 4, 41–44.

(13) Degen, J.; Rarey, M. FlexNovo: structure-based searching in large fragment spaces. ChemMedChem 2006, 1, 854–868.

(14) Douguet, D.; Munier-Lehmann, H.; Labesse, G.; Pochet, S. LEA3D: a computer-aided ligand design for structure-based drug design. J. Med. Chem. 2005, 48, 2457–2468.

(15) Durrant, J. D.; Amaro, R. E.; McCammon, J. A. AutoGrow: a novel algorithm for protein inhibitor design. Chem. Biol. Drug Des. 2009, 73, 168–178.

(16) Moriaud, F.; Doppelt-Azeroual, O.; Martin, L.; Oguievetskaia, K.; Koch, K.; Vorotyntsev, A.; Adcock, S. A.; Delfaud, F. Computational fragment-based approach at PDB scale by protein local similarity. J. Chem. Inf. Model. 2009, 49, 280–294.

(17) Hecht, D.; Fogel, G. B. A novel in silico approach to drug discovery via computational intelligence. J. Chem. Inf. Model. 2009, 49, 1105–1121.

(18) Nicolaou, C. A.; Apostolakis, J.; Pattichis, C. S. De novo drug design using multiobjective evolutionary graphs. J. Chem. Inf. Model. 2009, 49, 295–307.

(19) Nisius, B.; Rester, U. Fragment shuffling: an automated workflow for three-dimensional fragment-based ligand design. J. Chem. Inf. Model. 2009, 49, 1211–1222.

(20) Majeux, N.; Scarsi, M.; Caflisch, A. Efficient electrostatic solvation model for protein- fragment docking. Proteins 2001, 42, 256–268.

(21) Kolb, P.; Caflisch, A. Automatic and efficient decomposition of two-dimensional structures of small molecules for fragment-based high-throughput docking. J. Med. Chem. 2006, 49, 7384–7392.

(22) Vanommeslaeghe, K.; MacKerell, A. D. Automation of the CHARMM General Force Field (CGenFF) I: bond perception and atom typing. J. Chem. Inf. Model. 2012, 52, 3144– 3154.

(23) Vanommeslaeghe, K.; Raman, E. P.; MacKerell, A. D. Automation of the CHARMM General Force Field (CGenFF) II: assignment of bonded parameters and partial atomic charges. J. Chem. Inf. Model. 2012, 52, 3155–3168.

(24) Vanommeslaeghe, K.; Hatcher, E.; Acharya, C.; Kundu, S.; Zhong, S.; Shim, J.; Darian, E.; Guvench, O.; Lopes, P.; Vorobyov, I.; Mackerell, A. D. CHARMM general force field: A force field for drug-like molecules compatible with the CHARMM all-atom additive biological force fields. J. Comput. Chem. 2010, 31, 671–690.

97

(25) Vanommeslaeghe, K.; Hatcher, E.; Acharya, C.; Kundu, S.; Zhong, S.; Shim, J.; Darian, E.; Guvench, O.; Lopes, P.; Vorobyov, I.; Mackerell, A. D. CHARMM general force field: A force field for drug-like molecules compatible with the CHARMM all-atom additive biological force fields. J. Comput. Chem. 2010, 31, 671–690.

(26) Yesselman, J. D.; Price, D. J.; Knight, J. L.; Brooks, C. L. MATCH: an atom-typing toolset for molecular mechanics force fields. J. Comput. Chem. 2012, 33, 189–202.

(27) Budin, N.; Majeux, N.; Caflisch, A. Fragment-Based flexible ligand docking by evolutionary optimization. Biol. Chem. 2001, 382, 1365–1372.

(28) Wikipedia, Structure–activity relationship. Wikipedia, Structure–activity relationship http://en.wikipedia.org/wiki/.

(29) Wikipedia, Quantitative structure–activity relationship. Wikipedia, Quantitative structure– activity relationship.

(30) Kalyani, G.; Sharma, D.; Vaishnav, Y.; Deshmukh, V. S. A REVIEW ON DRUG DESIGNING, METHODS, ITS APPLICATIONS AND PROSPECTS. IJPRD 2013, 5, 015–030.

(31) Bleicher, K. H.; Böhm, H.-J.; Müller, K.; Alanine, A. I. Hit and lead generation: beyond high-throughput screening. Nat. Rev. Drug Discov. 2003, 2, 369–378.

(32) Csermely, P.; Korcsmáros, T.; Kiss, H. J. M.; London, G.; Nussinov, R. Structure and dynamics of molecular networks: a novel paradigm of drug discovery: a comprehensive review. Pharmacol. Ther. 2013, 138, 333–408.

(33) Bamborough, P.; Drewry, D.; Harper, G.; Smith, G. K.; Schneider, K. Assessment of Chemical Coverage of Kinome Space and Its Implications for Kinase Drug Discovery. J. Med. Chem. 2008, 51, 7898–7914.

(34) Frye, S. V. Structure-activity relationship homology (SARAH): a conceptual framework for drug discovery in the genomic era. Chem. Biol. 1999, 6, R3–R7.

(35) Lovering, F.; Bikker, J.; Humblet, C. Escape from flatland: increasing saturation as an approach to improving clinical success. J. Med. Chem. 2009, 52, 6752–6756.

(36) Manibusan, M.; Paterson, J.; Kent, R.; Chen, J. North American Free Trade Agreement (NAFTA) Technical Working Group on Pesticides (TWG). Guidance Document. North American Free Trade Agreement (NAFTA) Technical Working Group on Pesticides (TWG). Guidance Document, 2012.

(37) McKinney, J. D.; Richard, A.; Waller, C.; Newman, M. C.; Gerberick, F. The practice of structure activity relationships (SAR) in toxicology. Toxicol. Sci. 2000, 56, 8–17.

98

(38) Kapis, M. B.; Gad, S. C. Non-Animal Techniques in Biomedical and Behavioral Research and Testing. Lewis Publishers (CRC Press): Florida, 1993; p 36.

(39) Gade, K. Can we avoid animal testing entirely? ScienceNordic 2014, 6.

(40) Puzyn, T.; Leszczynski, J.; Cronin, M. T. Recent Advances in QSAR Studies - Methods and Applications. In Theoretical and Computational Chemistry; Springer: Dordrecht, 2010; p 8.

(41) Johnson, S. R. The Trouble with QSAR (or How I Learned To Stop Worrying and Embrace Fallacy). J. Chem. Inf. Model. 2008, 48, 25–26.

(42) Wikipedia. Random forest. Random forest http://en.wikipedia.org/wiki/Random_forest.

(43) Breiman, L. Random Forests. Random Forests http://oz.berkeley.edu/~breiman/randomforest2001.pdf.

(44) Yang, C.; Li, Y.; Zhang, C.; Hu, Y. A Fast KNN Algorithm Based on Simulated Annealing. A Fast KNN Algorithm Based on Simulated Annealing http://wwwmath.uni- muenster.de/u/lammers/EDU/ws07/Softcomputing/Literatur/1-DMI5460.pdf.

(45) Weidlich, I. E.; Filippov, I. V.; Brown, J.; Kaushik-Basu, N.; Krishnan, R.; Nicklaus, M. C.; Thorpe, I. F. Inhibitors for the hepatitis C virus RNA polymerase explored by SAR with advanced machine learning methods. Bioorg. Med. Chem. 2013, 21, 3127–3137.

(46) Weidlich, I. E.; Pevzner, Y.; Miller, B. T.; Filippov, I. V; Woodcock, H. L.; Brooks, B. R. Development and implementation of (Q)SAR modeling within the CHARMMing web- user interface. J. Comput. Chem. 2014, 36, 62–67.

(47) PUG REST. PUG REST https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html.

(48) PubChem Bioassay. PubChem Bioassay https://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi.

(49) Wang, Y.; Xiao, J.; Suzek, T. O.; Zhang, J.; Wang, J.; Bryant, S. H. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009, 37, W623–33.

(50) Wang, Y.; Bolton, E.; Dracheva, S.; Karapetyan, K.; Shoemaker, B. A.; Suzek, T. O.; Wang, J.; Xiao, J.; Zhang, J.; Bryant, S. H. An overview of the PubChem BioAssay resource. Nucleic Acids Res. 2010, 38, D255–66.

(51) Li, Q.; Wang, Y.; Bryant, S. H. A novel method for mining highly imbalanced high- throughput screening data in PubChem. Bioinformatics 2009, 25, 3310–3316.

99

(52) Xie, X.-Q.; Chen, J.-Z. Data mining a small molecule drug screening representative subset from NIH PubChem. J. Chem. Inf. Model. 2008, 48, 465–475.

(53) Wikipedia. Cross-validation (statistics). Cross-validation (statistics) http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29.

(54) Y-randomization. Y-randomization http://www.mathe2.uni- bayreuth.de/markus/pdf/pub/YRandQsar.pdf.

(55) Wikipedia. Receiver operating characteristic. Receiver operating characteristic http://en.wikipedia.org/wiki/Receiver_operating_characteristic.

(56) Martin, N. H. Integration of Computational Chemistry into the Chemistry Curriculum. J. Chem. Educ. 1998, 75, 241.

(57) Burkholder, P. R.; Purser, G. H.; Cole, R. S. Using Molecular Dynamics Simulation To Reinforce Student Understanding of Intermolecular Forces. J. Chem. Educ. 2008, 85, 1071.

(58) Paniagua, J. C.; Mota, F.; Solé, A.; Vilaseca, E. Quantum Chemistry Laboratory at Home. J. Chem. Educ. 2008, 85, 1288.

(59) Miller, B. T.; Singh, R. P.; Schalk, V.; Pevzner, Y.; Sun, J.; Miller, C. S.; Boresch, S.; Ichiye, T.; Brooks, B. R.; Woodcock, H. L. Web-based computational chemistry education with CHARMMing I: Lessons and tutorial. PLoS Comput. Biol. 2014, 10, e1003719.

(60) Pickard, F. C.; Miller, B. T.; Schalk, V.; Lerner, M. G.; Woodcock, H. L.; Brooks, B. R. Web-based computational chemistry education with CHARMMing II: Coarse-grained protein folding. PLoS Comput. Biol. 2014, 10, e1003738.

(61) Perrin, B. S.; Miller, B. T.; Schalk, V.; Woodcock, H. L.; Brooks, B. R.; Ichiye, T. Web- based computational chemistry education with CHARMMing III: Reduction potentials of electron transfer proteins. PLoS Comput. Biol. 2014, 10, e1003739.

100

APPENDIX A

Permission to Print CHARMMing Fragment-Based Docking Paper

Figure A1. Permission to print Fragment-Based Docking: Development of the CHARMMing Web User Interface as a Platform for Computer-Aided Drug Design

101

Permission to Print VTS Paper

Figure A2. Permission to print Virtual Target Screening: Validation Using Protein Kinases

102

Permission to Print CHARMMing (Q)SAR Paper

Figure A3. Permission to print Development and Implementation of (Q)SAR modeling within the CHARMMing web-user interface

103