Rational Design of Novel BCL2A1 Inhibitors for Treatment of Autoimmune Diseases: An Integration of Virtual Screening, Transcriptomics and Protein Biophysics

A Dissertation submitted to the Graduate School of the University of Cincinnati in partial

fulfillment of the requirements for the degree of

Doctor of Philosophy

In the Department of Molecular Genetics, Biochemistry and Microbiology

of the College of Medicine

By

Alexander Thorman

B.S./B.A. Miami University 2012

Committee Chair: Jaroslaw Meller, Ph.D. Abstract

The balance between cell survival and apoptosis is critical to the modulation of immune responses and misregulation of this balance often mediates diseases such as intrauterine inflammation, rheumatoid arthritis and cancer. The pro-survival proteins of the BCL2-family mediate a pro-survival phenotype through sequestration of the BH3-domain peptides that act as sensitizers to Bax and Bak, which serve as the cell executioners, resulting in mitochondrial outer membrane permeability. This dissertation deals with the design of small molecule inhibitors of

BCL2A1 (A1), which has been implicated in a wide array of diseases ranging from autoimmunity resulting in pre-term birth to chemotherapeutic resistance in cancer. To date, no inhibitors specific to A1 have been identified and most that target the BCL2 protein family are unable to effectively block A1 activity.

The strategy employed uses a range of approaches for the rational discovery of A1-BH3 interface inhibitors. Structure-informed or rational design of small molecules that are predicted to interact with A1 at the interface, thus sterically blocking BH3 peptides from binding, presents an opportunity to demonstrate how disrupting critical protein-protein interactions may be used to drive protection from autoimmunity, as well as blocking of a compensatory pro-survival mechanism in cancers treated with pro-apoptotic drugs.

Virtual screening is first employed to identify candidate small molecules predicted to bind A1, followed by in vitro approaches for biophysical characterization of the binding and activity of candidate inhibitors. Using a multi-stage virtual screening protocol coupled with experimental validation of top ranking candidates resulted in a reduction from 90,086 small

ii molecules to 13 that showed significant activity in vitro. These initial lead compounds lay the foundation for developing more specific and sensitive drugs targeting A1.

In addition to virtual screening, an alternative, approach to drug discovery that combines and Omics approaches has been developed and applied to identify putative inhibitors of the A1-BH3 interaction. This new approach, dubbed connectivity enhanced

Structure-Activity Relation (ceSAR), combines chemical similarity with the analysis of the connectivity between expression signatures of genetic and chemical perturbations generated by the LINCS consortium. Specifically, candidate inhibitors with transcriptional signatures concordant to those of genetic knockdown-induced loss of function of the target protein signature are identified using the iLINCS server. Such identified candidates are expected to interact with the protein target, or another pathway-member up- or down-stream of the target protein, driving a similar transcriptional change to a knockdown of the target of interest.

Chemical similarity is then utilized to cluster candidate compounds, identifying distinct structural classes, potentially targeting distinct proteins within the pathway of interest, and representatives of each cluster are used as a starting point for screening in vitro.

As part of this dissertation, a new bioinformatics tool that enables ceSAR analysis, dubbed Sig2Lead, has been developed in collaboration with other members of the group and made available to the community as an R Shiny application. The results obtained through using the Sig2Lead application implementing this method demonstrate that it can be combined with virtual to provide further enrichment of small molecule libraries.

iii

iv

Author Acknowledgements

Over the years, my family has been supportive of everything to which I have set myself.

My parents have encouraged my pursuit of knowledge and driven me to aspire to make the most

I can of myself. My brother has always pushed me to study harder and learn as much as I can to keep up with him, allowing us both to grow. My grandmother has always been closely involved in my life and supported me through school, encouraging me to hone my skills even from a young age. Without my family, I could not have made it to where I am today.

I would like to acknowledge my mentors over the past several years for all of their support and guidance in my scientific career. Andy and Jarek have driven me to be an effective multi-disciplinary researcher and I would not be the scientist I am today without their valuable insights and discussions over the past several years. This work has been a hugely collaborative effort and it took the training and efforts of my mentors to really allow me to develop a wide skillset. I hope to remember all of the valuable advice they have provided me with for the rest of my career.

Finally, I would like to dedicate my efforts to my amazing wife. Her support has been constant and needed through all of the ups and downs of both my scientific career and my life.

She was there when I started my graduate studies, and has been there for me every day since. Her support has sculpted me into the person I am and her encouragement has helped to get me to this point.

Thank you to everyone that has been a part of the adventure that is my life.

v

Table of Contents

Rational Design of Novel BCL2A1 Inhibitors for Treatment of Autoimmune Diseases: An

Integration of Virtual Screening, Transcriptomics and Protein Biophysics ...... 1

Abstract ...... ii

Author Acknowledgements ...... v

Table of Contents ...... vi

Table and Figure List ...... xi

Chapter I: Approaches to Rational Design of Novel Small Molecule Inhibitors ...... 1

Introduction ...... 2

Disease Association...... 2

Structure-Informed Design ...... 4

Virtual Screening for Drug Candidates ...... 7

Rigid-Body Docking ...... 8

Flexible Docking ...... 11

Interpreting Results of Virtual Screens ...... 12

Genomics Screening ...... 13

Tanimoto Coefficient for Comparing Structurally Similar Compounds ...... 15

Fragment Based Screening ...... 19

Peptide Mimetics ...... 20

vi

Chapter II: Rational Design of BCL2A1 Inhibitors ...... 22

Abstract ...... 23

Introduction ...... 23

Methods ...... 28

In silico Docking ...... 28

Protein Expression/Purification ...... 29

Thermal Shift Assay ...... 29

Fluorescence Polarization Assay ...... 30

LINCS gene expression analysis ...... 30

Cell Assays...... 31

Results ...... 31

Discussion ...... 47

Chapter III: Sig2Lead: Integration of Omics signatures and chemical similarity for improved structure activity relationship analysis and lead compound identification ...... 52

Abstract ...... 53

Introduction ...... 54

Methods ...... 57

Benchmarking of Sig2Lead ...... 59

Connectivity Alone for Inhibitor Identification ...... 60

Enchancing Virtual Screening ...... 60

vii

Results ...... 61

Sig2Lead as a Tool for Identification of Known Targeted Inhibitors from Pathway-Specific

Therapeutics ...... 61

Increased Sensitivity and Specificity of Transcriptomic Approaches in Drug Discovery

through Chemical Similarity ...... 63

Enhancing Docking Simulations for Identification of Small Molecule Inhibitors ...... 65

Discussion ...... 74

Chapter IV: Conclusions and Future Directions ...... 78

Identified Inhibitors of BCL2A1 ...... 79

HT1080 Cell Line Cell Death Assay ...... 79

Structural Resolution of BCL2A1 Inhibitors ...... 81

Development of Identified Inhibitors Through SAR and Compound Growing ...... 81

Preclinical Trials ...... 83

Future Developments of Sig2Lead ...... 86

Web Server and User Interface Updates ...... 86

Multi-Gene Searching ...... 87

Automated Docking of Sig2Lead Results...... 88

Increased Similarity Searching ...... 89

Sig2Lead Towards Personalized Medicine ...... 90

viii

Appendix I: Characterization of CovRS/ArlRS Two-Component Regulatory Systems as a mechanism for virulence regulation in Streptococcus pyogenes and Staphylococcus epidermidis

...... 92

Introduction ...... 93

Streptococcus pyogenes ...... 94

Streptococcus pyogenes pathogenic mechanisms ...... 94

Regulation of Virulence ...... 96

Treatments...... 97

Staphylococcus epidermidis ...... 97

Diseases...... 97

Biofilm Formation in S. epidermidis ...... 98

Two-component Systems ...... 101

CovRS ...... 102

In vivo selection of CovS mutants ...... 104

ArlRS ...... 104

Factors regulated by ArlRS ...... 105

PAS/PDC Folds ...... 106

In silico characterization ...... 107

In vitro characterization ...... 110

Conclusions ...... 113

ix

Appendix II: Hiding in Plain Site: Immune Evasion by the Staphylococcal Protein SdrE ...... 114

Appendix III: Running AutoDock Tools and AutoDock4.2.6 ...... 119

Running AutoDock Locally ...... 122

Preparing Protein (Receptor) ...... 122

Preparing Ligand ...... 123

Finish Preparing the Macromolecule ...... 124

Defining the Grid Box ...... 124

Prepare AutoGrid ...... 125

Running AutoGrid ...... 125

Preparing AutoDock Parameter File and Running AutoDock ...... 127

Running AutoDock on the cluster (high throughput)...... 130

Appendix IV: Running Sig2Lead ...... 136

References ...... 149

x

Table and Figure List

Figure 1. Heatmap (A) and MDS plot (B) illustrate the same clustering information in different ways...... 18

Figure 2. Upregulation of BCL2A1 is observed across a wide range of cancers...... 26

Figure 3. Overall scheme for structure informed identification of inhibitors of BCL2A1...... 27

Figure 4. Boxes for virtual screening of A1 and Bfl-1 targeting two pockets within the peptide binding groove...... 34

Figure 5. High Dose DSF (A) and FP (B) experiments were performed in parallel to further reduce the overall number of compounds to be tested...... 40

Figure 6. Fluorescence Polarization assays identify compounds inhibit Noxa binding...... 41

Figure 7. Checkerboard assays show additivity between P2 and P4 inhibitors...... 42

Figure 8. Clustering of tested compounds with compounds inducing gene expression signatures concordant to that of a BCL2A1 knockdown identifies groups of related compounds that are putative pathway inhibitors...... 44

Figure 9. Chemical structures of compounds structurally related to initial hits grouped by compound family...... 45

Figure 10. In vitro assays utilizing activated primary splenocytes reveal cell death in the presence of P2 inhibitor NSC-15508...... 48

Figure 11. Predicted P4 compounds drive some cell death in WT cells at high doses...... 49

Figure 12. Proposed binding of P2 targeted inhibitor, NSC-97318 (blue), and P4-targeted inhibitor, NSC-65847 (orange), to A1 suggest potential for bridging to drive a specific and sensitive interaction that blocks binding of BH3-Domain peptides...... 51

Figure 13. Overview of Sig2Lead methodology...... 58

xi

Figure 14. Sig2Lead enriches for inhibitors of target proteins...... 62

Figure 15. Receiver Operating Curve (ROC) analysis shows an enrichment of true positives in screening of libraries submitted to Sig2Lead...... 63

Figure 16: Sig2Lead enhances sensitivity of transcriptomic analysis...... 66

Figure 17. Library reduction of Sig2Lead (red) and AutoDockVina through MTiOpenScreen

(blue) shows an increase in true positives upon library reduction starting from the DUD-E library with structural similarity to something within the LINCS small molecule library...... 68

Figure 18. Sig2Lead can further enrich true positive populations from AutoDockVina by applying a transcriptomics filter to remaining compound libraries...... 69

Figure 19. ROC analysis of Sig2Lead after MTiOpenScreen on the full DUD-E dataset for

VGFR2 shows increased reduction of false positives when Sig2Lead is run after docking simulations...... 70

Figure 20. Utilizing a combination of docking simulations and Sig2Lead drives an increased ratio of true positives within a small molecule library...... 71

Figure 21. Compounds tested in vitro with an IC50<400 µM of BCL2A1 inhibition clustered with

LINCS-derived compounds...... 72

Figure 22. Percent of compounds from each level of in vitro analysis identified to have similarity to LINCS-derived compounds at Tanimoto similarity of 0.6, 0.7 or 0.8 show trends towards inclusion of all active compounds through structural similarity to compounds within the LINCS database...... 73

Figure 23. Reduction of compounds ordered for screening of BCL2A1 by running Sig2Lead and generating clusters with Tanimoto similarity of 0.65 results in enrichment of active compounds.

...... 75

xii

Figure 24. Chemical structures of NSC-97318 and NSC-15508 for targeting the P2 pocket of

BCL2A1...... 84

Figure 25. CD-HIT shows conserved domains in the intracellular domains of CovS (A) and ArlS

(B), but a lack of information about the sensor domains...... 103

Figure 26. Homology models for CovS (A) and ArlS (B) extracellular domains reveal a conserved fold for sensor domains...... 107

Figure 27. OMA Browser shows near evolutionary distance between ArlS (Red) and CovS

(Blue)...... 108

Figure 28. Homology-based (A-) and cavitation based (D) identification of potential ligand binding sites (red) in the CovS extracellular domain...... 109

Figure 29. Circular dichroism of refolded CovS suggests properly folded protein that is consistent with a mixture of α-helix and β-sheet...... 110

Figure 30. Sedimentation velocity AUC of refolded CovS shows a single species with predicted size of about 14kDa...... 111

Figure 31. Sedimentation velocity reveals a peak consistent with monomeric CovS-MBP in the presence or absence of LL37 antimicrobial peptide...... 112

Figure 32. AutoDockTools Interface...... 121

Figure 33. AutoDockTools User Preference Tab...... 126

Figure 34. Flow chart representation of running AutoDock in an iterative fashion on a high- throughput screening library...... 132

Figure 35. Installation of Packages in R Studio...... 138

Figure 36. Setting Sig2LeadShiny as the working directory in RStudio...... 139

xiii

Figure 37. Sig2Lead landing/search page. On this page, the user inputs target genes of interest and optionally compounds in SMILES format for inclusion in clustering steps of Sig2Lead analysis...... 140

Figure 38. LINCS Compounds Tab for display of all LINCS compounds with concordant expression signatures to user-defined target gene...... 142

Figure 39. Heatmap tab of Sig2Lead for viewing chemical similarity of LINCS small molecules with concordance to a knockdown of a user-defined target gene along with any user defined small molecules (magenta)...... 143

Figure 40. MDS Plot tab of Sig2Lead for an alternative view of hierarchically clustered small molecules derived from concordant signatures to easily view added compounds...... 144

Figure 41. The Similar Compounds tab shows NSC compounds with similar chemical structures to the centroids of each cluster...... 146

Figure 42. STITCH Network Interface shows known interactions reported in the literature..... 148

xiv

Table 1. Evaluation and compound numbers for each iteration performed in AutoDock 4.2.6. .. 10

Table 2. Compounds were tested by DSF thermal shift assays and single high dose FP assays

(100 µM compound)...... 35

Table 3. List of compounds included in SAR-like study for identification of important functional groups on lead compounds...... 46

Table 4: Sig2Lead increases sensitivity of transcriptomic approaches for identification of FDA- approved drugs targeting EGFR...... 64

Table 5: Sig2Lead increases sensitivity of transcriptomic approaches for identification of investigational drugs...... 65

xv

Chapter I: Approaches to Rational Design of Novel Small Molecule Inhibitors

1

Introduction

Rational drug design is the process of developing new therapeutics through a structure- informed method. First, a target must be identified as being associated with a disease state. This identification can be done through multiple approaches, but genomics and proteomics are often used as an initial step. Next, the macromolecule target must be structurally characterized in order to provide information for chemical design. Finally, an inhibitor needs to be characterized with regards to its relative inhibition of the target.

In silico modeling approaches are integral in many drug discovery processes. These approaches are useful as a means to generate hypotheses and range from homology modeling of unsolved structures (for further in silico methods) to docking simulations as a high-throughput screening technique for putative drug candidates, to pose predictions and solvent accessibility using molecular dynamic simulations. In this chapter, many of these techniques will be overviewed for their influence on drug discovery.

Disease Association

The first step in rational design of a drug is to identify a target protein responsible for a particular disease state. Traditionally, linkage analysis, genome-wide association studies

(GWAS) and proteomic approaches have been utilized in these associations1. Additionally, some recent computational approaches have had success in predicting disease association1.

Genomics studies can be performed as a means to link various genes to disease states.

GWAS is an approach commonly used to determine differences between healthy and diseased states to link a specific genomic locus to increased risk of a phenotype or disease of interest2.

Using Next Generation Sequencing (NGS), expression profiles can be obtained from cells in 2 different environments, highlighting differences driven by perturbations such as shRNA knockdown or tested drugs, as performed systematically for a large number of chemical and genetic perturbations by the LINCS consortium. These types of studies are particularly useful in determination of cancer subtypes to inform treatment options3.

Profiling mRNA expression is often followed by pathway enrichment analysis to characterize genes that exhibit differential expression in response to a perturbation or disease state4. By identifying enriched pathways, one can expand the list of candidate proteins to be targeted for inhibition, by selecting those pathway members with little cross-reactivity with other pathways and diseases. In other words, the function of the pathway can be often blocked more effectively at another level rather than targeting the specific gene responsible for a disease.

Another approach to gene-disease association is through mutational screening. This process is commonly used in determination of novel genes through knockdowns or knockouts of the whole gene in an otherwise healthy control where phenotypes that differ from the wild type are observed and characterized. Mutational analyses can be used in conjunction with other assays such as genomics or proteomics described here and serve as controls for downstream drug testing. This type of study can also be helpful in determining reasonable drug targets as they demonstrate the viability of an organism in the absence of a particular gene. In the case of the

BCL2 family, for example, knockouts of MCL1 result in embryonic lethality whereas a complete knockout of all BCL2A1 orthologs results in a very mild phenotype5–7. However, these proteins are hijacked to drive pro-survival phenotypes in various disease states8–11, illustrating the importance of a targeted therapeutic.

3

Proteins encoded by mRNAs may exist in many forms or not at all. In these cases, proteomics can be employed to compliment transcriptional studies and address questions about protein function more directly. In eukaryotes, many proteins have post-translational modifications (PTMs) that may change the functional of the protein dramatically. Even in prokaryotes, proteins are often phosphorylated as a means to send a signal to undergo some change. These changes can be detected through techniques such as mass spectrometry or high- throughput protein arrays to rapidly identify differences in protein expression under diseased conditions12.

Taken together, there is a wide range of ways to identify a gene-disease relationship.

Many academic and industrial labs are working on projects that eventually lead to a gene-disease relationship, or have an interaction that they would like to inhibit to drive some other phenotype.

It is with the above described tools and the work of numerous labs that these associations are determined and drug design can progress.

Structure-Informed Design

After determining a target protein, the target must be characterized in detail to rationally design a drug. This means that an accurate representation of the three dimensional structure must be determined, typically through X-ray crystallography or Nuclear Magnetic Resonance (NMR), along with information about the mechanism of action of the protein. To disrupt an interaction interface, the interface must be known, or the researcher must have a reasonable prediction of the interface. All of these data can be modeled; however, any theoretical approaches replacing concrete data will introduce bias and decrease the overall accuracy of these approaches.

4

X-ray crystallography is often utilized to capture snapshots of proteins or protein complexes. Crystallography requires protein to form a crystalline lattice, which will diffract X- rays in specific patterns, dependent on the 3D arrangement of atoms in the protein. After resolving the so-called ‘phase problem’, the diffraction data can be used to create an electron density map. Once this map is generated, the crystallographer builds a model of the protein structure into the density map and the structure undergoes refinement to achieve both good stereochemistry and consistency with the observed diffraction data13. This approach is able to determine the structure of proteins and complexes with low surface entropy, but struggles with proteins that exhibit a large degree of flexibility and unstructured regions. These structures give a means to directly map protein interfaces, active sites and binding sites when a protein is crystallized in the presence of a protein’s binding partner, which is useful in rational design of novel inhibitors.

NMR is a method for determining protein structure in solution. This technique involves heavy isotope labeling of the protein for the determination of chemical shifts driven by nearby atoms within the magnetic field. To accomplish this, 15N-HSQC is performed, which uses isotope-labeled nitrogen atoms (primarily in the peptide backbone) for connecting the protein backbone resonance. Next, Nuclear Overhauser Effects (NOE) will be determined for short range interactions. By connecting these resonance signals, protein structure can be determined using such as CHESHIRE14. The strengths of this technique are that proteins do not need to be crystallized, which in some cases is difficult and time-consuming, and that the structure is determined in solution, allowing mapping of more flexible regions that are often not captured by crystallography. However, NMR struggles in determination of large proteins or protein-

5 complexes (>25kDa), which crystallography can capture well15. For these reasons, structures are routinely solved with both crystallography and NMR.

Once solved via crystallography or NMR, structures are deposited into the protein data bank (PDB) from which their coordinate files can be obtained. With these coordinate files (PDB files) protein structures can be visualized, and the drug design process can be initiated16. PDB files are used as the input for docking in virtual screening, along with being used as a template in homology modeling of unsolved proteins. Additionally, structural techniques are useful as definitive evidence of drug binding after initial characterization.

When X-ray or NMR resolved structures are not available, computational modeling techniques can be used to obtain structural models of proteins for further testing. In order to increase the accuracy of these models, related proteins with known structures can be utilized to guide building of an effective model. The general approach of homology modeling is to identify homologs through multiple sequence alignments or prediction of secondary structure with fold recognition with structurally resolved proteins17. Once homologs are identified, the sequence of the unsolved protein is threaded through the resolved structure as a guide for the overall fold.

Obviously, this type of modeling is much easier when the homologs share a high-degree of identity, and becomes more difficult the more divergent the proteins are with regards to their amino acid sequence. Numerous methods and servers for structural modeling are available and can be combined to generate high-confidence consensus models18–20. When high-quality, these models can serve as stand-ins for solved structures for generating hypotheses until X-ray or

NMR structures become available21,22.

6

Regardless of the way in which a structure is derived, to further characterize a protein for inhibition of an interaction site, the interaction site must be identified. To this end, a number of in vitro and in silico techniques can be utilized. First, it is possible that the interaction is observed in the structure that was initially determined. This is the most straightforward and reliable case, as the interface is clearly mapped into the structure already and interacting residues can be easily mapped based on atomic coordinates, but many interactions are more transient and difficult to capture via crystallography. Predictions can be made of these sites using docking of known or predicted binding partners to the protein structure using web servers, such as ClusPro23. Additionally, interface predictions can be performed without knowledge of the structure of the binding partner by using relative solvent accessibility24. These theoretical approaches can be validated in the lab using techniques such as Fӧrster resonance energy transfer

(FRET), yeast two-hybrid screening, site directed mutagenesis, mass spectrometry or NMR25.

These techniques rely on modifying known residues either through labeling or mutation and observing the change that occurs upon binding, or lack thereof, of the small molecule.

Virtual Screening for Drug Candidates

Once interfaces are known and suitable pockets or other surface features are identified, screens of small molecules can be performed, targeted to the specific site of interest. Virtual screening can be performed on several scales – these experiments can be performed to limit large libraries to more manageable numbers, or to explore the likely interactions between a protein target and compounds known to interact. The application requiring these screens dictates the scale on which they need to be performed.

7

Rigid-Body Docking

Rigid-body docking is typically used for higher-throughput screens as a means to identify compounds that likely interact with a protein target at a predicted binding site, due to the relatively low-complexity calculations being performed. Running these simulations requires a solved structure or a high-confidence structural model and a library of three-dimensional structures of target compounds.

AutoDock is a widely used state-of-the-art software package for both rigid-body as well as flexible docking simulations of small molecules binding to a target protein. Rigid-body docking is typically used as an efficient first step, especially if bound conformation of the complex can be assumed to be well approximated by the resolved protein structure used in simulations. This method relies on converting the targeted protein site to a three-dimensional grid. At each node of the grid, forces felt by different types of ligand atoms due to interactions with the protein are precomputed, together with the resulting contributions to a semiempirical free energy force field used by AutoDock. These energies are stored, allowing for a rapid computation of the approximate semiempircal free energy of binding for each ligand pose (and thus complex conformation) in the search for energy minima that proceeds through the sampling of different ligand poses using a genetic algoorithm. The free energy of binding is defined as the energy of protein and ligand individually minus the energy of the complex as described by:

퐿−퐿 퐿−퐿 푃−푃 푃−푃 푃−퐿 Equation 1: ∆퐺 = (푉푏표푢푛푑 − 푉푢푛푏표푢푛푑) + (푉푏표푢푛푑 − 푉푢푛푏표푢푛푑) + (푉푏표푢푛푑 −

푃−퐿 푉푢푛푏표푢푛푑 + ∆푆푐표푛푓)

In which V is the pair-wise calculation of predicted free energy in each given state, L is the ligand, P is the protein and ΔSconf is the term for loss of torsional entropy upon binding in a

8 protein-ligand interaction (VP-P stands for the intra-molecular energy of the protein). This general equation consists of computed pairwise energies that include terms for dispersion/repulsion, hydrogen bonding, electrostatics and desolvation26.

For each compound, AutoDock applies a Lamarckian genetic algorithm for a number of evaluations specified by the user. These evaluations are the number of cycles each compound is allowed to undergo. In each cycle, a compound is allowed to change orientation, conformation and translation in the vicinity of its starting point. Top performing poses will be allowed more

“offspring” which will inherit the “genes,” or features, of the “parent” pose and undergo some

“crossover” with other top performing poses. This crossover event will allow random distribution of the “genes” in the next generation, possibly finding a better fit. Each generation undergoes the same treatment until either the energy minimum is found, or the specified number of evaluations has been reached16. The inheritance function is described by:

푓푤−푓푖 Equation 2: 푛0 = 푓푤 ≠< 푓 > 푓푤−<푓>

where n0 is the integer number of “offspring” (rounded down to the nearest integer), fw is the value with the highest ΔG (worst individual), fi is the individual pose fitness and is the mean fitness27. These fitness values are described by the predicted free energy from the force field. If the predicted ΔG for fi ≤ , then n0 will always result in at least 1 offspring. If at any point fw =

, the docking has reached convergence and the algorithm stops, rather than continuing to the specified number of evaluations. Additionally, this process of evaluation of free energy is repeated a number of times, as specified by the user, for independent experiments. These independent evaluations can then be assessed for the consistency of pose prediction.

9

AutoDock can be parallelized across a computational cluster for use in high-throughput virtual screening. This approach has been previously utilized by the Meller group using an iterative search, which starts with a large library of compounds and a low search depth

(evaluation number) and progressively increases the search depth while removing compounds with poor binding as predicted by the semiempirical force field utilized by AutoDock4. The iterations performed in studies described in this dissertation are described in Table 1. This is a means to rapidly process tens of thousands of compounds or more, enriching for those that are likely able to interact with the target protein in a specific manner. Once this step is performed, poses can be assessed by converting docking log files into standard PDB files by clustering identical poses into the same state28.

Table 1. Evaluation and compound numbers for each iteration performed in AutoDock 4.2.6. Each performed iteration reduces the overall number of compounds while increasing the search depth for the AutoDock algorithm. Iterations were performed on the CCHMC computing cluster, allowing for parallelization across hundreds of nodes.

Number of Compounds Genetic Algorithm Evaluation Number 90,086 250,000 30,000 1,000,000 3,000 10,000,000

In addition to standard docking approaches, rigid body docking can be used to predict binding of protein-reactive compounds to serine, threonine and cysteine residues. This approach is similar to the above described docking, but extends to compounds that would normally form a covalent bond with nucleophilic amino acid side chains. For these reactive compounds, it is important that the interaction is specific, as covalent inhibitors can be reversible or irreversible, but electrophiles can react with many nucleophiles. Methods aimed at covalent inhibition rely on compound libraries including electrophiles attached to various chemical scaffolds in order to

10 generate specificity of binding. These methods also rely on additional constraints due to the need for specific interatomic distances and bond angles29.

Flexible Docking

Rigid-body docking is a powerful tool that enables high-throughput screening of large numbers of compounds in a rapid fashion. The limitation however, is that this type of docking analysis only takes into account the crystal structure, which is a snapshot of the protein in a rigid conformation and does not account for potential flexibility of the protein. To address this, flexible docking analyses can be performed, which allow rotatable bonds and flexibility of the receptor protein in addition to the small molecule interacting partners, determined through an ensemble of several snapshots of the protein. These protein states can be determined through multiple crystal structures in various bound and unbound states, or through molecular dynamic simulations. These snapshots can then be artificially “morphed” into one another to observe intermediate states for docking of small molecules30. This process produces several conformations of the protein, allowing detection of intermediates that could interact with a small molecule even when the crystallized forms could not due to steric clashes. Docking of each of these conformations is referred to as an ensemble approach and can be performed as either a cross-docking or a mean-field approach31. The cross-docking approach refers to docking each conformation separately and scoring the compounds independently. This approach increases the overall time, which could be problematic if the library size or number of intermediates is large, but allows detection of compounds that are predicted to bind to a specific conformation. The mean-field approach runs all conformations simultaneously through averaging of the mean force- field. Additional methods for handling flexible docking include soft interfaces which allow steric clashes32 or hinge motion modeling33.

11

Interpreting Results of Virtual Screens

The results of these virtual screens are not necessarily trivial to interpret. First, as docking approaches rely on predictions of free energy of binding and exploring the overall energy landscape of a binding pocket, it is possible for the algorithm to get stuck in a local minimum.

This issue can, to a large degree, be avoided by running several independent screens, producing multiple different poses. This approach can also be used to provide measures of consistency in predicted poses to aid in the selection of candidate molecules for further testing.

To this end, a threshold of Shannon Entropy of Clustering can be employed34. This measures the level of overall convergence of clustering through

Equation 3: 푯(푿) = − ∑풊 푷(풙풊) ∗ 풍풐품ퟐ(푷(풙풊)) where H is (unnormalized) entropy, and P is a probability function35,36. In the case of docking small molecules, this equation describes the probability of a given pose in the overall number of poses. In the virtual screening protocol described above in which fifty independent screens were run for which poses were identified, a diverse set of poses would have a high value (close to 1 for normalized entropy) of entropy and a single identified pose would have a Shannon Entropy score of 0. By using a (normalized) Shannon Entropy threshold of 0.5, the likely non-specific interactions can be screened out, while those that are predicted to bind in a specific fashion are maintained for inclusion.

In addition to Shannon Entropy reduction, predicted inhibition and pose assessment are vital to interpreting results from virtual screening. The output of AutoDock is a docking log file

(DLG) which stores the pose, predicted inhibition constant and cluster of each pose and can be deconvoluted into a clean display using the protein-ligand docking models option in the

12

Polyview-MM server37. Once loaded into this server, DLG files and their target receptor can be converted to PDB files for ease of viewing in software such as PyMol and the small molecule- interacting residues are mapped to a contact map. These results can give insight into residues in the target protein involved in likely interactions with small molecules identified through virtual screening. Additionally, the predicted inhibition constants (Ki) provided from docking serve as a means to generate a rough order of effectiveness of inhibitors as a way to aid in prioritization.

Once candidate compounds have been tested in vitro, or as a means to provide further evidence of specific binding, a more thorough method of pose assessment involves interrogating analogs through structure-activity relation (SAR) analysis. To this end, close relatives of the lead compound should be docked independently and observed for convergence on a single pose. If multiple close analogs are predicted to bind in a similar fashion, it is much more promising that a pose can be predicted accurately. Additionally, with these more accurate pose predictions, a researcher can design modifications of compounds that increase or decrease the likelihood of binding, providing further evidence of accurate pose prediction. This type of prediction allows researchers to design experiments to validate binding in the absence of a solved structure with the compounds.

Genomics Screening

Genomics approaches can be utilized to identify compounds that are likely inhibitors of a target pathway. This type of screen is not reliant on the structure of a target protein, but instead on expression data under various perturbations. Generally, this type of approach is utilized in an effort to identify drug targets, taking known drugs and mapping their genomic echo – the transcriptomic effects of adding the drug – to see if it can reverse the effects of a disease, thus identifying a mechanism of action38. This additionally allows for detection of off-target effects

13 that were unknown until a more global phenotype could be mapped, as with profiling of mRNA transcripts through mRNA-seq or microarray data.

Recently, the Library of Integrated Network-Based Cellular Signatures (LINCS) consortium has collected hundreds of thousands of genomics profiles of genetic knockdown and chemical perturbation using the L1000 assay as the transcriptional readout39, which have been connected to one another by iLINCS, a database for the integration of LINCS data40,41. The

L1000 assay is a genomic profiling approach that characterizes the expression of approximately

1000 genes under various perturbations, such as shRNA knockdown or added small molecules, to enable fast profiling of a large number of perturbations in multiple cell lines39. The concept of the L1000 assay was to identify “landmark” genes based on greatest variance across multiple microarray experiments, such that the included genes are the most informative, but also sample a wide range of protein families 39. This approach is significantly cheaper than typical mRNA-seq experiments, costing about two dollars per experiment in reagents. With this cost reduction, it became possible to perform these experiments on a large number of perturbations, including gene knockdowns and small molecule perturbagen treatment across eleven cell lines39. Using iLINCS, connections between these L1000 profiles are generated, measuring concordance between vectors of expression values of the thousand genes collected in the L1000 on a -1 to 1 scale, corresponding to perfect anti- and positive-correlation, respectively. Chemical perturbagens (small drug-like molecules) that drive a similar transcriptional response (signature) to a knockdown of a given gene (positive concordance value) are likely pathway inhibitors for the targeted pathway, thus allowing identification of novel lead compounds or repurposing of existing drugs to novel targets via high-throughput genomic screening 39–41.

14

Tanimoto Coefficient for Comparing Structurally Similar Compounds

Regardless of the means of obtaining putative lead compounds initially, similar compounds can be identified through structural similarity measurements. The Tanimoto coefficient between two compounds is widely used a measure of chemical similarity42. This value is calculated by converting compounds into a fingerprint representation of chemical moieties and measuring the overlap between them. Fingerprints are a means of converting a chemical structure into a 1D profile, in this case a 1024-bit binary vector which maps substructures (subgraphs) within each compound. Software packages, such as OpenBabel, can perform these conversions rapidly for ease of comparisons and include multiple libraries of fingerprints. It is important that any compounds tested for similarity have been converted to fingerprints using the same library to avoid any incorrect comparisons. Once all compounds have been converted, the Tanimoto coefficient can be determined between each by using a Jaccard

Index (Equation 4)42:

[∑푛 푥 푥 ] Equation 4: 푆 = 푗=1 푗퐴 푗퐵 퐴,퐵 푛 2 푛 2 푛 [∑푗=1(푥푗퐴) +∑푗=1(푥푗퐵) −∑푗=1 푥푗퐴푥푗퐵]

Or simply,

퐴∩퐵 Equation 5: 푆 = 퐴,퐵 퐴∪퐵

where SA,B is the similarity between the fingerprint of compound A and B. This equation is equivalent to the simplified expression shown in Equation 5. Due to the probability-like structure of this equation, the Tanimoto similarity measure is reported on a 0 to 1 scale, with identical compounds equaling 1 and completely different compounds equal to 0. The Tanimoto similarity scores can be converted to Tanimoto distance by simply subtracting the similarity score from 1.

15

Tanimoto similarity is a measure often used in the identification of structural analogs of compounds known to produce the desired effect. As structure dictates function, compounds with similar structures often produce similar functions. By using high Tanimoto similarity scores

(≥0.8), it is possible to identify compounds that maintain very similar scaffolds, but have minor changes in functional groups. These slightly different compounds can then be tested in previous assays to generate a structure activity relationship (SAR) study that identifies functional groups that can increase or decrease the effectiveness of an inhibitor. SAR is a way to fine-tune binding to create more potent inhibitors after lead compounds are identified. Additionally, SAR studies provide additional insight as to the mode of binding a given compound exhibits, allowing researchers to better predict binding poses from virtual screening. Additional SAR studies are performed once a chemical architecture is solidified, changing out individual functional groups for others that cause minute differences in chemical properties which may further increase the potency and specificity of a small molecule.

Lower Tanimoto similarity scores (~0.6) can be utilized for initial screening of compounds. Compounds with a generally similar structure often come out in high-throughput and virtual screens, cases in which testing all involved compounds is not practical. To address this, compounds can be clustered by chemical similarity, including only a few representatives of each cluster for initial experimental screening and validation. When representatives show activity, additional members of their clusters can be tested to determine the groups or core architecture that results in the most potent compounds. This methodology is employed in

Chapters II and III of this dissertation to limit the overall search space, allowing for fewer overall compounds exploring a diverse set of chemical moieties to be tested.

16

Due to the number of compounds compared in this type of analysis, it is generally easier to visualize these comparisons as a distance matrix displayed as a heatmap. These heatmaps are simply grids of values represented by colors for ease of . The colors can be on a variety of scales, but generally, there are two colors, one representing either extreme value, with a gradient mixture of the two for intermediate values. These heatmaps are generally associated with dendrograms which use pairwise distances between compounds in order to obtain hierarchical clustering that reflects similarity between compounds, along with the label of compound identity. These representations are useful for identifying large groups of related compounds, but can be difficult to interpret when a large number of small clusters is observed.

To address this shortcoming, patterns in data can be further visualized and simplified through multidimensional scaling (MDS). MDS is a dimensionality reduction technique that can be used to generate a projection of high dimensional data into fewer dimensions (here 2D) while approximately preserving distances between data points (here Tanimoto distances generated).

Using a distance matrix, MDS plots display all data points as a scatterplot43,44 where each compound is placed so that its position in a Euclidean plane is representative of the distance from other compounds. The axes for MDS are unitless, as the overall scale is relative to the distances, and serve only as a reference for easy visualization of a complex problem. Since there is no unique solution to MDS due to having fewer solvable equations than tested variables, it is strictly for reference and displaying high-dimensional data. To reduce the error in MDS plots, and make presentation more clear, clusters have been reduced to their centroids, or representative compounds, with distances from other centroids plotted in the MDS plot. To incorporate the compounds excluded during MDS, clusters are plotted as pie charts with radii corresponding to the size of the cluster and centered on the coordinates of each centroid in MDS. These pie charts

17 can be utilized to display information related to the cluster, such as how many within a cluster were later added to the analysis due to prior knowledge of the compound. In Chapter III, for instance, drugs that are known inhibitors of a target protein are included to identify structurally related compounds identified through the Sig2Lead application. Throughout this dissertation, data will be displayed with either an MDS plot or a heatmap, but both display the same information. Figure 1 shows a side-by-side representation of a heatmap and an MDS plot displaying the same hierarchical clustering data.

A B

Figure 1. Heatmap (A) and MDS plot (B) illustrate the same clustering information in different ways. Heatmaps are used here to display the distance matrix of each compound compared to one another with the axes being dendrograms of compounds. Clusters can be obtained by cutting dendrograms at a specified height, or can be optimized based on the number of clusters desired. The length of branches in the dendrograms represent the distance from other branches. As an alternative visualization of patterns observed in data, MDS plots can be generated, displaying either all compounds in relative relation to one another or, as displayed above, representatives of each cluster in relationship to each other by minimizing the error in their distance relative to the distance matrix displayed in the heatmap. The displayed MDS plot further has pie charts with radii sizes corresponding to cluster size and is color coded based on the origin of the determined compounds (in this case known mTOR inhibitors with Sig2Lead identified inhibitors, discussed in Chapter III).

18

Fragment Based Screening

Screening of fragment libraries is a slightly different approach to design of novel inhibitors. Rather than starting with drug-like compounds, researchers screen libraries of

“fragments” as a means to explore a larger chemical space. That is, very small molecules that are roughly half the size of a typical drug are screened for their effectiveness at binding a target in a site of interest, usually with poor binding affinity. Generally, X-ray crystallography or NMR are utilized to determine which fragments in a fragment library are binding, and to determine their mode of binding.

After identifying multiple of these binding fragments, the moieties are linked and optimized through medicinal chemistry to develop a full size inhibitor with high binding affinity45. This type of screening can lead to difficulties based on linker selection, as each bond has a fixed length and angle that can be accommodated, and incorrect stereochemistry can completely abolish binding of a given fragment46,47. Alternatively, fragments are “grown” into other binding grooves by gradually adding to the fragment through medicinal chemistry, building compounds with greater affinity. This growing technique sometimes uses multiple fragments in a similar fashion to linking fragments, but generally starts from a single point and builds into the second, as opposed to directly tethering the fragments48.

Unfortunately, traditional methods of fragment-based screening rely on large amounts of protein along with well-behaved fragments, and omitting fragments with poor solubility results in dramatic reduction in the fragment library and omission of various chemical moieties that may be important in binding of a specific target. To address these concerns, docking studies as described under the “Virtual Screening for Drug Candidates” section can be performed on

19 fragment libraries. Using virtual screening approaches, fragments can be tested regardless of solubility and less time and protein are required. Often, virtual screens for fragments can further help to develop linkers for the fragments with pre-defined constraints, although testing these fragments is still necessary49.

Regardless of the assay for characterization of the interaction or the screening method, fragment-based screening is an approach for exploring the vastness of chemical space and identifying components with an increased affinity to the target macromolecule. These approaches allow researchers to explore functional groups of fragments to determine groups that drive binding. These groups can be connected or built from to develop highly specific and sensitive inhibitors.

Peptide Mimetics

Many current inhibitors for protein interfaces are peptide mimetics, derived either from alanine screening of native ligand structure or through high-throughput screening of peptide libraries. This type of inhibitor is designed to mimic peptides that interact in order to form a high-affinity interaction to displace the normal ligand. To accomplish this goal, pharmacophores are identified with similarity to an amino acid side chain and connected via non-peptide bonds.

The lack of peptide bonds allows for a longer pharmacological half-life, producing drugs that are more bioavailable. Additionally, these bonds tend to reduce the overall cost of a drug as compared to a true peptide inhibitor, since the compound is longer lasting and cheaper to produce en masse50.

Numerous such mimetic drugs exist, many of which target apoptotic pathways involved in cancer. For instance, peptide mimetics of the BH3-domain of pro-apoptotic proteins for

20 inhibition of BCL2-family proteins drive cell death by displacement of pro-survival BCL-2 proteins from their pro-apoptotic target. Other such targets of apoptosis inhibition via peptide mimetics include MDM2 and SMAC/DIABLO51. Existing mimetic drugs targeted to the apoptotic pathways drive inhibition of some BCL2-family members, but are not particularly specific, driving off-target effects such as thrombocytopenia and tumor lysis syndrome51.

Additionally, none of the existing drugs inhibit BCL2A1, a BCL2-family member that has been demonstrated to be upregulated in autoimmune diseases and cancers, with a mild phenotype in knockout mice10,52,53.

21

Chapter II: Rational Design of BCL2A1 Inhibitors

Alexander W. Thorman, Sarah A. Hummel, Ian C. Brett, William L. Seibel, David A.

Hildeman, Jarek Meller, and Andrew B. Herr

22

Abstract

BCL2A1 (A1) has been implicated in a wide array of diseases, ranging from autoimmunity resulting in pre-term birth to chemotherapeutic resistance in cancer. To date, no inhibitors specific to A1 have been identified and most that target the BCL2 protein family are unable to effectively block A1 activity. To address this, a virtual screening approach was employed to enrich for lead compounds to be tested in vitro that would be targeted to two pockets of A1 within the BH3-binding groove. This multiple target strategy was developed to rationally design a drug that would have high sensitivity and specificity to A1 without cross- reactivity with other family members. In this study, an initial library of 90,086 compounds was screened in silico, resulting in 148 compounds tested with purified protein. Of these, a number of compounds have demonstrated inhibition of A1 and associated BH3-domain peptides in a biochemical assay. Some of these compounds have shown cooperative inhibition of BH3 binding when added in combination to A1. Additional compounds have been identified from an orthogonal genomics screen: many of which are structural homologs of the hits from the original virtual screening. Overall, we have identified approximately 50 candidate lead compounds that can serve as building blocks for the design of a high-affinity, specific inhibitor of A1.

Introduction

Members of the BCL2 family of proteins are related through homology of their BCL2 homology (BH) domains5. These proteins are subdivided into three main classes: 1) anti- apoptotic or pro-survival factors, 2) pro-apoptotic activators, and 3) pro-apoptotic sensitizers9,54.

Upon stress signaling, cells undergo apoptosis via mitochondrial outer membrane permeabilization (MOMP) which is driven by pro-apoptotic BCL2 family members. The pro-

23 apoptotic BCL2 proteins are comprised of BH3-only proteins, such as Noxa, Puma, and Bim that serve as sensitizers, and the multidomain proteins, Bax and Bak that function as executioners.

Bax and Bak are critical regulators in MOMP; they function by forming a pore in the mitochondrial outer membrane, thus releasing cytochrome c54,55. The BH3-only peptides activate

Bax and Bak to allow for this permeabilization to occur and are upregulated during times of cell stress54,56,57.

Pro-survival BCL2 proteins function through either prevention of Bax/Bak activity or sequestration of BH3-only activator proteins, resulting in inactive Bax/Bak. The pro-survival

BCL2 family proteins sequester the pro-apoptotic members, driving a pro-survival phenotype and blocking MOMP54,58. A number of family members have been described with varying targets

54,59 and cell types in which they are present (BCL2, Bcl-XL, Bcl-w, MCL1, Bcl-B and BCL2A1) .

Due to the pro-survival phenotype, many of these proteins have been implicated in a wide range of cancers10.

BCL2A1 (A1 in mice, Bfl-1 in humans) shares 72% identity between mouse and human orthologs and is unique among the protein family due to its ability to be completely knocked out and not result in embryonic lethality5,6. This is likely due to its overall function as a regulator of

T-cell and neutrophil maturation, maintenance of CD4+ T-cells54,60 and its absence in most other normal cellular environments. Knockout mice for all isoforms of A1 have revealed a decrease in total CD4+ T-cells and regulatory T-cells, along with conventional dendritic cells in the spleen6,7, but an otherwise mild phenotype. A1 is expressed as a target of NF-κB and is upregulated in the presence of reactive oxygen species found at sites of inflammation8,54,56. This upregulation is expected to maintain the activated state of neutrophils and T-cells at these inflammatory sites, thus prolonging the lifespan of these cells when they are needed for an immune response33,54.

24

A number of drugs have targeted BCL2-family proteins with varying success, most notably ABT-737, ABT-263, ML214 and ABT-199 (venetoclax). ABT-737, ABT-263

(navitoclax) and ML214 target the BCL2-family broadly, inhibiting BCL2, BCL-xL, and BCL-

W, but not specifically BCL2A1 or MCL-110,52,53. ABT-199 is a more targeted therapeutic that specifically inhibits BCL2, but not other family members61. Importantly, none of these drugs to date has been able to specifically inhibit A1 even though its gene knockout in mice shows the fewest complications; in fact, A1 is often able to compensate for an inhibited BCL2-family member when present5.

Knockdowns of A1 provide a protective phenotype to anaphylaxis in mice11.

Additionally, NF- κB signaling has been shown to cause a resistance to chemotherapeutic agents in an A1 dependent manner and this machinery is often hijacked in cancers (Figure 2)8–10.

Additionally, A1 is responsible for the maintenance of neutrophils in the maternal-fetal interface during pregnancy, which can drive intrauterine inflammation and pre-term birth62. Despite these phenotypes, a potent inhibitor of A1 remains elusive. To address this gap, the approach summarized in Figure 3 was performed.

Both A1 and Bfl-1 have been crystallized in the presence of several known and synthesized BH3-domain proteins including Bid, Puma, Bak, Noxa, Bim and Bmf(PDBs: Bfl-1:

4ZEQ, 5UUK, 5UUL, 5UUP, 3I1H, 3MQP, 2VM6, A1: 5WHH, 5WHI, 2VOF, 2VOG, 2VOH,

2VOI)63–65 showing a conserved hydrophobic pocket (P2 pocket), consistent with other members of the family66. Additionally, the BH3-binding groove extends into a broad, shallow pocket in

A1, but not other family members, potentially providing a unique surface suitable for targeting of small molecules (P4 pocket) (Figure 4). The differences in this binding groove could explain the

25 lack of inhibition in the presence of more traditional BCL2-family inhibitors such as ABT-737 or navitoclax resulting in compensation as described by Vogler et. al67.

Figure 2. Upregulation of BCL2A1 is observed across a wide range of cancers. A1 expression observed in cBioPortal across a range of cancer datasets from the TCGA database reveals significant upregulation in a wide range of cancer cells.

26

Figure 3. Overall scheme for structure informed identification of inhibitors of BCL2A1. The overall strategy (A) and the results from each stage in the method (B) are displayed above. Starting with a subset of the NCI library consisting of 90,086 small molecules, likely binding compounds were enriched for in a virtual screen and characterized for binding and inhibition in vitro through thermal shift assays and fluorescence polarization. In vitro hits were further characterized in vitro using primary splenocytes through a trypan blue cell survival assay where a specific inhibitor was identified via Bax/Bak-/- cells. As an additional validation and alternative avenue to identify more candidate inhibitors, BCL2A1 knockdown signatures from the LINCS data consortium were utilized, and all chemical perturbagens that showed strong concordance with knockdown signatures were analyzed and compared to the set identified through docking.

27

Methods

In silico Docking

Docking simulations were performed using a parallelized version of AutoDock

4.2.616,68,69. Docking was performed using A1 (PDB: 2VOH) bound to the Bak BH3-domain and

Bfl-1 (PDB: 3MQP) bound to the NOXA BH3-only peptide. Two pockets were selected for targeting of each protein, the P2 canonical BH3-binding site (x: -6, y: 10, z: 59) (npoints x: 40, y:

34, z: 45), and the shallower P4 pocket specific to A1/Bfl-1 (x: 6, y: 14, z: 56) (npoints x: 52, y:

36, z: 50). These docking simulations were performed on the Cincinnati Children’s Medical

Center computational cluster over three iterations utilizing a subset of the 2007 NCI library consisting of 90,086 compounds. Each iteration decreased the overall number of compounds while increasing the depth of the search as described previously by Biesiada et al., 201228. The first iteration used the whole library with 250,000 evaluations. The second iteration took the top

30,000 compounds sorted by predicted Ki and used 1,000,000 evaluations, and the final iteration used 10,000,000 evaluations on the top 3,000 compounds. After each iteration for A1, an identical iteration was run on Bfl-1 and only the overlapping top compounds were carried into the following iteration. This intersection allowed for identification of compounds that would likely interact with both the human and mouse homologs and took into account slight differences that may be present from different binding partners.

After all three iterations were complete and the intersection of the top 450 compounds was performed, the remaining compounds were entropy filtered based on the convergence of their predicted poses. Compounds with clustering entropy greater than 0.5 were removed, leaving approximately 150 compounds for targeting each pocket. These compounds were clustered based on similarity of chemical moieties via the Tanimoto Coefficient using the

28

70 ChemmineR package in R . Representatives from each cluster with the best predicted Ki were selected for in vitro testing. Of the selected compounds, 79 targeted the P2 pocket and 79 targeted the P4 pocket.

Protein Expression/Purification

A1 residues 1-152 (P104K, C113S) was expressed in the BL21 strain of Escherichia coli in an H596 vector with a hexa-His-MBP tag kindly provided by Dr. Artem Evdokimov. All purification steps were performed in 20 mM Tris pH 7.0 and 500 mM NaCl. Cells were induced with 0.2 mM IPTG overnight after which they were pelleted and lysed via sonicator. After cell lysis, cell debris was pelleted out and the supernatant was harvested, filtered and run through Ni- affinity chromatography. Protein-containing fractions were pooled and the tag was removed with

TEV protease rocking at room temperature overnight. The cleaved proteins were run over an additional Ni column to remove the His-MBP tag and His-tagged TEV protease. Finally, protein was run over a S75 size exclusion column and concentrated to suit assay needs.

Thermal Shift Assay

100 μM compounds were applied to purified A1 at 4.4 μM in triplicate. Sypro Orange dye was added at a final dilution of 1:1000 to protein- and compound- containing wells. An

Applied Biosystems StepOnePlus was used to perform Differential Scanning Fluorimetry (DSF) by elevating the temperature from 20℃ to 99℃ and measuring fluorescence at every half degree. Melting temperature was recorded as the maximum of the first derivative, indicating half of the protein population was unfolded. Compounds that were observed to have a positive change in Tm compared to the control of greater than three standard deviations were included for future assays.

29

Fluorescence Polarization Assay

Compounds were additionally tested for specificity of binding by displacement of FITC- labeled mouse Noxa (mNoxa) (Peptide2.0) peptide via Fluorescence Polarization (FP). FP

Assays were performed in two steps: single-point high-concentration compounds and dose response of fluorescence polarization hits. A1 was added at 3 μM to 100 μM of each compound in 20 mM Tris pH 7, 500 mM NaCl, 0.005% Tween-20 buffer. After addition of 375nM labeled mNoxa, 96-well plates were incubated overnight at 20℃ in the dark to achieve equilibrium before fluorescence polarization was measured with a Biotek Synergy H2. Autofluorescent compounds and fluorescent quenching compounds were corrected via ratiometric correction as described by Shapiro et al, 200971. Any compounds that showed a significant shift in polarization, along with those identified in the thermal shift assays had a dose response measured via FP.

Dose response curves were measured by adding 3 μM A1 to a serial two-fold dilution series of each compound ranging from 400 μM to 781 nM in 20 mM Tris pH 7, 500 mM NaCl,

0.005% Tween-20 buffer. Some compounds were further tested to assess the accuracy of these two-fold dilutions with serial 1.33-fold dilutions. A1, compounds, buffer and lastly, 375nM

FITC-labeled mNoxa were added to each well and incubated in the dark at 20℃ overnight to achieve equilibrium, followed by measurement of polarization. All dose responses were performed in triplicate.

LINCS gene expression analysis

LINCS compounds were identified due to concordance to a Bfl-1 knockdown signature as computed through iLINCS40,41 (see Chapter III). The compounds included only those within the top 0.5% of all perturbagen signatures’ concordance values. The SMILES code of each

30

LINCS compound was obtained and converted to FP2 fingerprints using the OpenBabel software impelented in R through the ChemmineOB package72. These fingerprints were compared using the Tanimoto correlation and clustered via hierarchical clustering of Tanimoto distance42. After hierarchical clustering, centroids were identified from each cluster with at least four representatives and multidimensional scaling was performed to display approximate Tanimoto distances between each centroid.

Cell Assays

Single cell suspensions from spleen were generated by maceration through a 100 µm nylon mesh followed by LympholyteM ficoll gradient separation (Cedarlane Labs, Burlington,

NC). Purified cells were then cultured on anti-CD3 coated (3 µg/mL, coated overnight,

Biolegend, San Diego, CA) six-well plates in the presence of soluble anti-CD28 (2 µg/mL, Bio

X Cell, West Lebanon, NH) and IL-2 (10 ng/mL, R&D Systems, Inc., Minneapolis, MN) in

RPMI media (Life Technologies, Carlsbad, CA) for 24 hours at 37℃. Cells were then washed and cultured again in IL-2 (10 ng/mL) for 24 hours at 37℃. Cells were harvested and cultured for 24 hours on anti-CD3 coated 96-well plates at 500,000 cells per well with 2 µg/mL soluble anti-CD28, 10 ng/mL IL-2, 0.125 µg purified anti-mouse FasL (Biolegend, San Diego, CA), and varying concentrations of A1 inhibitor compounds +/- polybrene (2 µg/mL, EMD Millipore,

Burlington, MA). Cells were then harvested and live and dead cells enumerated by trypan blue staining using the TC20 automated cell counter (Bio-Rad Laboratories, Des Plaines, IL).

Results

To address the lack of A1 inhibitors, DSF thermal shifts, FP competition assays and cell death assays were used as filtering steps, resulting in a number of compounds that are capable of

31 inhibiting A1 in vitro, and a couple that are active in vitro using primary cells (Figure 3). Each set of experiments reduced the overall number of compounds tested, while increasing the complexity of the experiment and providing more informative data characterizing the interaction, or lack thereof, of each compound. In addition to these filtering steps, additional validation was performed on lead compounds using LINCS data at the gene expression level and synergy experiments were performed to detect the ability for compounds to work together to drive an increase in their inhibitory capacity.

A virtual screen was performed to enrich the number of potential lead compounds to be tested in vitro. This screen consisted of three iterations of docking of a subset of the NCI library to A1 and Bfl-1 in parallel. First, a drug-like subset of the NCI library consisting of 90,086 small molecules was docked to the P2 pocket of A1 (Figure 4) with 250,000 evaluations. After docking to A1, the same library was used against the P2 pocket of Bfl-1 (Figure 4C). This pocket is conserved throughout the protein family and is a deep pocket in the center of the binding groove. It is expected that compounds targeted to this site could have a high-affinity interaction, since this is a deep pocket with several interacting residues. The compounds in the top 30,000 based on predicted inhibition constant from each run were collected and the intersection of top hits common to the A1 and Bfl-1 searches was used as the library in the subsequent round of docking. Two more rounds of docking were performed on the P2 pocket with increasing search depth on the reduced compound libraries with greater numbers of evaluations. This process was then repeated for the P4 pocket, a surface within the BH3-binding groove that is unique to BCL2A1 with considerable differences, as compared to other BCL2- family members (Figure 4D), suggesting that an inhibitor that can interact with this site would likely provide a specific interaction. After three rounds of docking for each pocket, the

32 intersection of the top compounds left approximately 300 compounds per pocket. Compounds were further refined based on their entropy of pose clustering by removing any compounds that had Shannon Entropy of clustering above 0.5 (As described in Chapter I). Compounds were clustered by similarity of chemical moieties via Tanimoto Coefficient and representatives of each cluster were ordered for testing in vitro.

Compounds that were to be tested in vitro were first tested for binding to A1 using differential scanning fluorimetry (DSF) thermal shift assays (Figure 5A) followed by an inhibition assay measuring displacement of a fluorescently tagged BH3 peptide by fluorescence polarization (FP) (Figure 5B). These assays were performed in a semi-high-throughput fashion as an initial screening step for further validation. DSF measures the midpoint temperature (Tm) for thermal denaturation of proteins; typically ligand binding stabilizes the protein fold, resulting in an increased Tm. DSF experiments resulted in 20 compounds with positive Tm shifts and another 20 with negative Tm shifts, while a number of compounds exhibited fluorescent quenching or autofluorescence, obfuscating some of the data (Table 2).

33

Figure 4. Boxes for virtual screening of A1 and Bfl-1 targeting two pockets within the peptide binding groove. The binding groove is highlighted by the Bak BH3-domain peptide (cyan) from A1 (white) (A) (PDB:2VOH). Grid boxes for docking simulations are targeted to both mouse BCL2A1, A1 (B) (PDB:2VOH), and Human BCL2A1, Bfl-1 (C) (PDB:3MQP). Atoms within 4.5Å of the BH3 ligand from each crystal structure are highlighted in pink. Superposition of A1 with BCL2 (red) (PDB: 5JSN) reveals significant differences within the P4 pocket (D). In this figure, the P2 pocket is displayed as the yellow grid box and the P4 pocket is displayed as the blue grid box.

34

Table 2. Compounds were tested by DSF thermal shift assays and single high dose FP assays (100 µM compound). Those with significant reduction in polarization through high dose were additionally tested with a dose response. Due to compound fluorescence, DSF values are often inaccurate (sometimes resulting in what appear to be dramatic Tm shifts), necessitating a secondary approach. High Dose FP was normalized using unbound Noxa as the 0% bound baseline and A1 with Noxa but no compound as the 100% value. NSC-65847 shows a negative %bound Noxa due to an average polarization value below that of Noxa alone (This however falls within the error for these assays). Compounds are ordered by their high dose FP results.

NSC ID Average Tm Shift (℃) High Dose FP %Bound Noxa 65847 1.92 -6.62 97318 -30.75 1.2 20530 38.82 4.5 79711 0.17 23.84 45195 -1.5 36.11 114566 0.33 36.67 128598 38.83 37.77 687589 1.37 39.73 45576 14.75 39.74 95407 33.32 40.1 15508 0.49 41.48 666136 -1.42 42.04 79588 11.41 42.72 408132 -2.11 43.15 45538 29.16 43.51 9360 -33.6 45.98 45600 -0.59 46.69 15787 8.49 48.34 15792 -1.43 51.13 65820 11.62 56.8 65823 1.16 60.6 45569 35.86 62.06 4492 -12.92 62.06 154659 0.24 62.31 88839 -0.89 66.56 45201 0.43 68.17 400681 19.25 70.39 7223 37.41 70.74 379866 0.33 70.86 13728 19.82 73.95 17249 -0.41 75.56 16087 -2.26 77.7 640559 -0.38 77.83

35

NSC ID Average Tm Shift (℃) High Dose FP %Bound Noxa 48459 1.32 78.75 10441 -0.09 79.42 620127 0.32 79.5 14755 0.58 80.1 48163 1.08 80.36 634601 5.57 80.44 611615 -1.01 80.6 142516 0.08 80.94 632901 -2.74 81.21 377159 0.34 81.6 346555 -0.25 81.65 310326 -0.68 81.84 374898 0.08 81.85 47941 -4.18 82.12 691274 -2.95 82.48 371880 0.83 83.07 687805 -1.68 83.12 85321 -4.08 83.93 47711 -18.57 84.77 637605 -0.93 85.25 678918 -0.42 85.97 408161 0.22 86.03 39915 0.81 86.7 47932 38.32 86.87 666713 -10.08 87.81 87001 0.32 88.36 89166 -0.17 88.89 610938 0.57 89.21 404265 -0.28 89.52 687808 1.25 90.47 304954 -0.17 90.65 656732 -8.58 90.76 89296 0.33 91.11 376248 -0.17 91.73 30854 -4.87 92.09 109111 0.3 92.24 45567 -1.23 92.99 351569 0.82 93.05 78677 35.92 93.38 287404 -0.17 93.88 20529 18.05 94.53 128588 -1.18 95.19

36

NSC ID Average Tm Shift (℃) High Dose FP %Bound Noxa 49634 -9.66 95.36 8797 -1.18 95.57 112661 -0.17 96.3 90615 0.32 96.67 109838 -1.93 96.69 408133 -0.68 96.76 617988 -0.17 97.12 514050 -0.09 97.12 114453 -0.97 97.68 201951 0.17 99.34 292451 0.5 99.45 378145 -0.23 99.64 65536 -32.58 99.67 46540 -2.26 99.67 51525 0.17 100 403165 -0.56 100.14 310325 -0.17 100.36 676996 -2.43 100.55 15796 -0.85 100.64 24953 -0.42 100.74 382875 -1.47 100.96 20670 -0.17 101.11 647252 -0.28 101.26 100781 -1.69 101.32 379447 -6.93 101.59 408036 -0.92 101.59 21560 -0.68 102.77 628412 -1.18 103.33 127134 -0.67 103.64 359822 -0.66 103.82 400831 0.33 103.88 39914 0.33 104.07 382688 0.5 104.43 24056 -0.7 104.44 6132 -33.94 105.14 687794 0.5 105.25 48877 0.57 105.93 21008 0.58 106.67 359821 -2.17 106.95 39912 -0.93 107.07 10458 1.17 109.32 657608 -3.44 109.55

37

NSC ID Average Tm Shift (℃) High Dose FP %Bound Noxa 645017 9.31 109.82 45203 0.42 110.29 45559 -11.83 110.29 14574 -0.6 111.58 45522 -0.33 111.9 363997 0.32 112.75 661088 -6.65 112.75 64753 11.08 113.3 118065 0.57 115.56 48460 -0.43 117.04 119459 -0.17 118.52 45202 -0.34 120.26 119285 -0.83 120.37 119143 -0.18 120.37 628413 0.29 121.62 48691 -0.16 134.44 114792 -9.95 137.04 51261 0.08 139.26 645804 -1.43 140.46

For this reason, DSF provided a good base for removing a number of compounds that showed no evidence of binding to A1; however, the number of compounds with fluorescence issues such as quenching necessitated an orthogonal approach. For this reason, all representative compounds were additionally tested for inhibition of BH3 binding through competition-based FP assays73 (Table 2). These assays utilized fluorescently labeled Noxa peptide which binds within the BH3-binding groove. The FP assay measures the tumbling rate of the labeled Noxa peptide; when the Noxa is bound to A1, its rate of tumbling is low (and FP is high), whereas an effective inhibitor compound will displace Noxa, causing a fast rate of tumbling (and low FP value). First, a single high-dose experiment was performed with each compound, which used 100 μM of compound with A1 and FITC-Noxa, to observe the relative inhibition of the A1-Noxa interaction

(Figure 5B). This resulted in 41 small molecules that deviated by three or more standard

38 deviations of the control mean towards inhibition of binding. These compounds and any that were three standard deviations or more above the thermal shift controls, along with those that were initially derived from early compounds with reasonable IC50s, were tested in a dose response experiment. To this end, 30 putative P2 inhibitors and 44 putative P4 inhibitors had their IC50s determined. Representative dose response curves are shown for both putative P2

(Figure 6A) and P4 (Figure 6B) inhibitors. Of the dose responses, there was a surprising enrichment of putative P4 inhibitors among those that had IC50 values less than 50 μM.

Compounds that had an IC50 less than 50 μM were tested for additive and/or synergistic effects with compounds predicted to interact with the other pocket (Figure 6C-D). NSC-97318 and

NSC-65847 (Figure 6C) show an additive effect when the observed IC50s of the compounds together are plotted onto the isobologram of the independent compounds’ IC50 following the linear equation described by Tallarida, 201174:

풂 풃 Equation 6: + = ퟏ, 푨 푩 where a and b are the concentrations for NSC-97318 and NSC-65847 at which 50% inhibition of

Noxa binding is observed when combined and A and B are the IC50 values of the compounds independently of one another. Other compounds tested with NSC-97318 did not show the same additivity (Figure 6D). Additionally, NSC-9360, NSC-45195 and NSC-65847 show cooperative inhibition of Noxa binding with NSC-15508 (Figure 6E). A checkerboard experiment demonstrates the overall additivity in binding of two compounds (Figure 7).

39

120 160

140

100 120

100 80 Tm(℃) 80

Tm(C) 60 60

%Bound Noxa %Bound 40 Fully Bound Noxa +/- 3SDs P2 IC50 > 100uM 20 P2 IC50 <100uM 40 P2 Inhibition of WT cell survival P2 No Dose Response 0 P4 IC50 > 100uM Tm P4 Dose Response IC50 <100 +/-3SD P4 No Dose Response 20 -20 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 Compound (Ranked by IC50) Compound (Ranked by IC50)

Figure 5. High Dose DSF (A) and FP (B) experiments were performed in parallel to further reduce the overall number of compounds to be tested. DSF and FP were performed with 100 µM of each compound for binding or Noxa binding inhibition. By DSF, all compounds that drove an increase in Tm that corresponded to at least three standard deviations from the mean were included in further assays. Additionally, any compounds that showed a decrease in polarization of three or more standard deviations from the mean and those that were derived from early compounds with measurable IC50 values were included for further dose response studies.

40

A B 120 120 P2 P4 100 100

80 80

60 60

40 40

%Bound Noxa %Bound

%BoundNoxa

20 20

97318 9360 88839 0 0 45195 15796 114566 20530 128598 -20 -20 1 10 100 1 10 100 C [Compound] (uM) D [Compound] (uM) 120 120

100 100

80 80

60 60

40 40 20

%BoundNoxa

%Bound Noxa 20 0

-20 65847 + 97318 0 45195 + 97318 65847 45195 -40 -20 1 10 100 1 10 100 E [65847] (uM) [45195](uM) 120

100

80

60

40

20

%Bound Noxa 9360 + 15508 0 9360 65847 -20 65847 + 15508 45195 -40 45195 + 15508 -60 1 10 100

[Compound](uM) Figure 6. Fluorescence Polarization assays identify compounds inhibit Noxa binding. P2 (A) and P4 (B) compounds that displayed decreased polarization or increased Tm values further at least three standard deviations from the control were tested in a dose response FP experiment and their IC50s were determined. Those with an IC50<50 μM were tested in synergy experiments utilizing NSC-97318 at its IC50 in combination with a titration of P4 inhibitors (C-D). Additionally, NSC-15508 was tested in combination with a titration of P4 inhibitors (E).

41

[45195] (µM)

200 100 50 25 12.5 6.25 0

200 66 60 58 58 56 57 57

100 64 59 55 55 54 60 63

50 64 58 53 62 63 69 73

25 72 55 51 64 88 108 113

[97318] (µM) [97318] 12.5 67 56 54 74 119 126 129

6.25 63 51 56 114 132 135 139

0 70 53 81 127 145 146 154

Noxa Alone 67

A1 + Noxa 155

Figure 7. Checkerboard assays show additivity between P2 and P4 inhibitors. NSC -97318 and NSC-45195 were tested for additivity across a wide range of concentrations and demonstrated a low level additivity for inhibition of A1-Noxa binding. Polarization values (mP) are displayed upon titration of a P2 (97318) and a P4 (45195) inhibitor. Values in Blue are those above the IC50 threshold and those in red are at the completely unbound state.

Compounds were additionally ordered and tested using a structure activity relationship

(SAR)-type approach of compounds related to those that were performing well in initial testing

(Figure 9). For each compound derived from another in the above-described screens, full dose responses were tested and IC50 values were calculated (Table 3). This approach served to build upon existing lead compounds and search for related molecules with higher affinity interactions,

42 demonstrating some key functional groups likely involved in binding. For instance, NSC-45195

(Figure 9P) and NSC-65847 (Figure 9V) are structurally quite similar, with NSC-65847 being a duplication of most of the NSC-45195 scaffold with symmetry across the di-azide bonds, yet the larger molecule drives roughly a two-fold increase in inhibition. Effective P2 inhibitors seem to require a hydrophobic group to interact with the deepest cavity of the P2 pocket, along with some polar groups to facilitate proper registry of binding. Many of the predicted binding compounds contain a sulfonate group which is predicted to interact between the P2 and P4 pockets (Figure 12A).

As an additional screen for activity, compounds were clustered with other compounds that show similar changes in gene expression profiles to a BCL2A1 knockdown. This analysis revealed a large cluster of compounds related to NSC-97318 and NSC-65847, along with other

P2 and P4 predicted inhibitors, with a Tanimoto coefficient above 0.6. This same clustering performed with a Tanimoto cutoff of 0.8, while not conserved for all compounds included in this analysis, still found clustering for NSC-97318 with a number of LINCS compounds (Figure 8).

The LINCS-derived compounds included in this clustering were identified through iLINCS40,41 by searching for a knockdown of BCL2A1 and identifying concordant chemical perturbagen signatures as indicating likely apoptosis pathway-specific inhibitors. Clustering based on

Tanimoto coefficient yielded distinct compound clusters that are comprised of molecules with similar chemical moieties that likely target various members of the apoptotic pathway. The connectivity analysis between BCL2A1 gene knockdown and chemical perturbation transcriptional signatures, coupled with in vitro inhibition suggests a group of structurally-related compounds that could drive inhibition of A1.

43

Figure 8. Clustering of tested compounds with compounds inducing gene expression signatures concordant to that of a BCL2A1 knockdown identifies groups of related compounds that are putative pathway inhibitors. Compounds tested in vitro identified as having an IC50 value below 400 μM were clustered (based on Tanimoto Coefficient) with compounds identified via iLINCS as having signature concordance to a knockdown of A1. These LINCS-derived compounds are likely pathway inhibitors due to similarities in gene expression data, providing further support that structurally related compounds (Tanimoto similarity ≥0.7) would also inhibit the pathway. Within this clustering, it is expected that various clusters would inhibit different members of the apoptotic cascade, driving the same downstream effect. Clusters are displayed as pie charts centered on sites identified by multidimensional scaling of centroids of clusters with a minimum of four compounds. The radius of each pie chart corresponds to the number of compounds comprising the cluster. A few representatives from the cluster containing NSC-97318 are depicted to the left showing a generally conserved structure. Some P2 and P4 compounds are found within the same cluster at this lower Tanimoto threshold, but are resolved upon higher thresholding. This overlap may arise in part due to the overlap between the P2 and P4 pockets.

44

408161 Family

48459 Family 97318 Family 408161 Family Family

15787 Family

45195 Family Family

Family

65820 Family

Figure 9. Chemical structures of compounds structurally related to initial hits grouped by compound family.Family Analogs to initial hits from early screens were ordered for further testing. Some of these compounds show close similarity in structure, but deviate in IC50 value, illustrating the importance of distinct functional groups.

45

Table 3. List of compounds included in SAR-like study for identification of important functional groups on lead compounds. Compounds identified in early screens as lead compounds were expanded upon based on structural similarity. In some cases, this led to dramatic changes in IC50 values, highlighting groups that are necessary or dispensible in drug design. The structure column corresponds to the structures displayed in Figure 9 above. Bolded compound names are initial hits with their analogs listed below. Finally, IC50 values from FP dose response curves are displayed, illustrating some major differences.

Structure Compound IC50 (µM) A 97318 27.3 B 10458 >400 C 6132 387 D 16087 211 E 374898 13.8 F 408161 397 G 46540 >400 H 201951 >400 I 48459 389.5 J 79050 >400 K 15787 >400 L 14574 299.6 M 15796 333.6 N 15508 49.1 O 15789 >400 P 45195 31.8 Q 65536 >400 R 79711 15.8 S 45203 234.9 T 45522 >400 U 78677 234.9 V 65847 19.4 W 45600 104.5 X 10441 79.8 Y 79588 100 Z 45567 >400 AA 65820 68.9 AB 9360 45.8 AC 65823 119.6 AD 45201 >400 AE 45559 >400 AF 7223 320 AG 45538 59.2 AH 45202 >400 AI 51525 55 AJ 47711 189.3 AK 45576 32

46

To provide further validation of these results, an in vitro assay utilizing primary splenocyte cells was used as a model system. In this system, compounds were added to wild-type primary splenocyte cells from mice that had been cultured and activated over four days. Lead compounds with IC50 values below 50 µM were added to the cells before observing cell death.

The results suggested killing with the P2 compounds tested as demonstrated in wild type cells.

However, this killing mostly lacked specificity, seen by cell death in Bax/Bak-/- cells (Figure

10). At low concentrations, NSC-15508 demonstrated specific killing which can be further optimized through the growing of this compound from the P2 pocket to the P4 pocket. Predicted

P4 inhibitors have not demonstrated the same killing effects as the initial assays seen with the P2 predicted inhibitors, but may still be useful as part of a bridged compound (Figure 11).

Discussion

Using a combination of virtual screening and semi-high-throughput biophysical techniques, NSC-15508 was identified from a large library of readily available compounds as a lead compound capable of inducing cell death by itself at sufficiently high concentrations.

Additionally, NSC-65847 was identified as an inhibitor in vitro, which is able to provide an additive effect of inhibition with NSC-97318. When comparing the most probable poses identified through docking of these two compounds, it becomes apparent that they likely cannot coexist within the BH3-binding groove in their current forms, but could likely show a more synergistic effect if they were properly bridged. These two compounds share a common sulfonate group interacting at a positively-charged patch along the BH3-binding groove (Figure

12).

47

100 15508 WT 15508 Bax/Bak

80

60

40

20

%live cells/mL (Control Normalized) (Control cells/mL %live

0 1 10 100 [15508][15508] ((uM)µM)

Figure 10. In vitro assays utilizing activated primary splenocytes reveal cell death in the presence of P2 inhibitor NSC-15508. Mice deficient in Bax/Bak (blue) are unable to undergo intrinsic apoptosis due to the lack of the executioners in the pathway. Thus, cell death seen in WT cells (red) with P2 inhibitors that is also seen in the Bax/Bak-/- mice is suggestive of cytotoxicity as opposed to specific inhibition. The region in gray indicates a region in which this compound may drive specific cell death.

48

A B 120

200 100

150 80

100 60

50 40

0 65847 WT + Polybrene 20 9360 WT + Polybrene 65847 WT - Polybrene

%live cells/mL (Control Normalized) %live cells/mL (Control Normalized) 9360 WT - Polybrene -50 0 1 10 100 1 10 100

C [65847] (uM) D [9360] (uM) 160 180

160 140 140 120 120

100 100

80 80

60 60 40 45538 WT + Polybrene 40 45538 WT - Polybrene 20

%live/mL (Control Normalized) 45538 Bax/Bak + Polybrene 7223 WT + Polybrene 20 0 45538 Bax/Bak -Polybrene %live cells/mL (Control Normalized) 7223 WT - Polybrene

0 -20 1 10 100 1000 1 10 100 E [7223] (uM) [45538] (uM) 200

180 45195 WT + Polybrene 45195 WT - Polybrene 160

140

120

100

80

60

40

20

%live cells/mL (Control Normalized)

0 1 10 100 [45195] (uM)

Figure 11. Predicted P4 compounds drive some cell death in WT cells at high doses. P4 compounds should be more specific to BCL2A1 and drive fewer off target effects and should enhance specificity in bridged compounds. Compounds were tested with (blue) and without polybrene (red) to attempt to cross the cell membrane.

49

Additional compounds were identified as demonstrating an in vitro effect and having structural similarity to compounds derived from gene expression profiles, but not driving cell death when applied on their own (Figure 11). Presumably, these compounds were unable to easily cross the cell membrane or diffuse to the target site on A1, explaining their lack of activity in the cell death assay. Despite the lack of cellular activity, these compounds will be useful for further structural studies. It is possible that compounds closely related to these lead compounds may be less problematic for crossing the cell membrane, resulting in compounds that still bind with high specificity and affinity, but function as better therapeutics. Starting with these compounds, an inhibitor that binds both the P2 and P4 pockets can be developed for specific A1 inhibition reducing cross-reactivity to other BCL2-family members and off-target effects.

These studies set the basis for development of a potent and specific inhibitor to A1. NSC-

97318 and NSC-15508 both show strong inhibition of Noxa binding in FP assays with some effect on cell survival, albeit likely non-specific. From these initial scaffolds, along with a number of P4 binding compounds, it is conceivable to build a single molecule that spans the

BH3-binding groove for specific and sensitive inhibition of A1. The implications for this inhibitor allow for a potential treatment for intrauterine inflammation. The Kallapur group has implicated BCL2A1 in driving survival of chorio-decidua neutrophil responsible for intrauterine inflammation, a leading cause in pre-term labor, through mRNA expression62. Additionally, the compensation to navitoclax and ABT-737 could be overcome in a system treated with this inhibitor, such as chronic lymphocytic leukemia.

50

Figure 12. Proposed binding of P2 targeted inhibitor, NSC-97318 (blue), and P4-targeted inhibitor, NSC-65847 (orange), to A1 suggest potential for bridging to drive a specific and sensitive interaction that blocks binding of BH3-Domain peptides. NSC-97318 (blue) binds within the P2 pocket (contacting residues in red) of the BH3-binding groove while NSC-65847 (orange) binds within the P4 pocket (contacting residues in blue) (overlapping residues in purple) with a conserved sulfonate group as a potential handle for bridging the two compounds to generate a higher affinity and more specific inhibitor of A1 (A). A 3D backbone structural map highlighting alpha carbons of putative interacting residues (blue) along the BH3-binding groove (B) and sequence contact maps of NSC-97318 (C) and NSC-65847 (D) show a convergence of interacting residues across docking simulations upon residues that would normally make up the BH3-binding groove.

To further elucidate an inhibitor for A1, additional screening is required, including SAR studies of current compounds and exploration of novel compounds that exhibit similar expression profiles to knockdowns of A1, along with bridging these distinct chemical moieties across the BH3-binding groove. One step towards this end will be described in Chapter III. This approach will integrate chemical similarity searching with transcriptomic profiling of small molecules to identify compounds that drive similar expression profiles to an A1 knockdown while being chemically related to compounds already identified in this screen. By combining these approaches, the most effective individual molecules can be identified before beginning to optimize compound bridging.

51

Chapter III: Sig2Lead: Integration of Omics signatures and chemical similarity for improved structure activity relationship analysis and lead compound identification

Alexander Thorman, James Reigle, Somchai Chutipongtanate, Behrouz Shamsaei, Marcin

Pilarczyk, Mehdi Fazel-Najafabadi, Rafal Adamczak, Michal Kouril, Mario Medvedovic,

Jarek Meller

52

Abstract

Experimental drug discovery and repurposing is a costly and time-consuming process. Here, a novel approach that combines cheminformatics and Library of Integrated

Network-Based Cellular Signatures (LINCS) gene expression signatures of genetic and chemical perturbations, as well as a new application to facilitate such integration, dubbed “Sig2Lead”, are introduced. This approach utilizes the largest collection of expression profiles of cellular perturbations to date to identify small molecule inhibitors that drive a similar expression pattern to knockdowns of a target pathway. From this list of compounds, structural similarity searching is performed to identify cluster representatives, enable Structure-Activity Relation (SAR) analyses and facilitate more rapid screening, while extending chemical search space.

Presented below, connectivity enhanced Structure Activity Relation (ceSAR) is a novel methodology that combines structural similarity of small molecules with transcriptional connectivity and virtual screening predicted specific interacting molecules for the enhancement of SAR and lead compound discovery. ceSAR is implemented as an R Shiny package, dubbed

Sig2Lead, that employs connectivity between chemical and genetic signatures included in the

LINCS library in order to identify small molecule inhibitors of targeted pathways. The application begins with a view that allows a user to input a target gene. From here, the app determines chemical compounds in LINCS/NCI that exhibit a similar transcriptomic signature

(L1000 assay) to a knockdown of the target gene. Small molecules are clustered by chemical similarity and representative compounds are selected from the computed clusters. These representatives can then be used to filter drug candidates for experimental analysis. Three cases are illustrated for improvements this methodology can provide: Sig2Lead’s effectiveness as enriching over a random library, increased sensitivity of traditional transcriptomics approaches

53 and increased specificity in the context of molecular docking. Sig2Lead has been demonstrated to yield significant enrichment in true binders in a diverse set of targets through use of docking benchmark libraries. When used as an additional filter over an existing transcriptomics tool,

Sig2Lead was able to identify FDA-approved and investigational drugs compiled from

DrugBank for a target selected based on its inclusion as a knockdown in LINCS, having known inhibitors (some of which are not present within LINCS) – including those that were not identified by transcriptomics alone. Virtual screening of small molecules is demonstrated to have an additive effect when combined with Sig2Lead to improve the search for putative candidates, resulting in a further enrichment of candidate compound libraries into true positives. In conclusion, Sig2Lead is a novel computational tool that utilizes expression signatures to aid in the drug discovery process by enhancing existing approaches and providing an easy-to-use, rapid approach to further filter compound libraries, enriching for true positives.

Introduction

In silico screening of small molecule libraries for their predicted interaction and inhibition of protein targets is vital to reduce the time and cost requirements in drug discovery and repurposing projects. Traditionally, wet lab techniques were employed to test possible drug candidates, generating large amounts of data, but requiring a significant time and monetary investment and screening a large number of non-inhibitors in the process. The emergence of in silico techniques allowed one to assess possible drug candidates and reduce the number of candidate drugs to be tested by hand, dramatically decreasing the costs and man-hours of such screens28,68.

54

It is well known that structure dictates function and has been demonstrated that this principle is applicable to small molecule inhibitors79. Virtual screening has utilized chemical similarity to enhance existing techniques by performing searching similar compounds for SAR.

Additionally, proteomics approaches utilize chemical similarity to predict interactions with structurally related molecules80.

Experimental profiles of drug activity have been widely used in drug design, mode of action identification and SAR type analysis39,75–78. Examples include identifying targets based on similar bioactivity profiles to known inhibitors (BASS)77, which compares small molecule similarity based on biological responses, and using connectivity map approach to connect gene expression profiles of gene knockdowns or chemical perturbations on smaller scales (CMap)39,78.

These previous attempts to identify small molecule inhibitors have been based entirely upon the bioactivities of a given small molecule compared to that of another small molecule as opposed to comparing the bioactivity to the activity of a gene knockdown or overexpression. These types of approaches have been demonstrated to recover true positives but have not integrated chemical similarity for increasing the overall success rates or the combination of gene knockdown and chemical perturbation signatures.

The Library of Integrated Network-Based Cellular Signatures (LINCS) comprises a consortium that supports six Data and Signature Generation Centers (DSGCs) from institutes across the country, along with the BD2K-LINCS Data Integration and Coordination Center. The

LINCS consortium catalogs gene expression, along with other cellular processes that occur when cells are exposed to a variety of perturbing agents. In the presented methodology, the L1000 signatures of gene knockdowns are assessed for concordance or discordance with all LINCS compounds using the iLINCS (www.ilincs.org) web service tool46,47. L1000 is a high-throughput

55 gene expression assay that measures the mRNA transcript abundance of 978 “landmark” genes from human cells. Computational analysis of gene expression compendia by the Broad Institute’s

CMap group suggested that these 978 genes capture a large fraction (~80%) of information of the transcriptome at a fraction of the cost39,78. LINCS catalogs various cancer and normal cell lines and thousands of gene knockdowns (33,410 to date comprising multiple shRNA consensus knockdowns and cell lines) and chemical perturbations (68,960 to date, across multiple cell lines). These data, that represents the largest attempt to date to systematically collect cellular signatures for over 40,000 small molecules and over 7,000 unique gene knockdowns46,47, as made available by the iLINCS resource, have permitted the development of a tool that relies on quick retrieval of chemical perturbagens that have a similar (positively correlated) or dissimilar

(negatively correlated) cellular effect as those of a targeted gene knockdown.

An alternative, to the best of our knowledge, novel approach introduced here utilizes gene expression signatures in combination with structural similarity to assess the cellular effect of each compound and its application as a putative inhibitor or activator. This type of strategy does not require a crystal structure of the target protein and allows rapid assessment of a large number of compounds. Specifically, the L1000 gene knockdown and compound treated cell signatures are assessed for concordance or discordance. The concordant or discordant compounds are then compared using Tanimoto similarity and clustered accordingly.

The Sig2Lead application implements ceSAR through leveraging iLINCS application program interface (API) calls to allow investigators to quickly and efficiently obtain a representative set of compounds from LINCS that display a similar cellular expression signature to a gene knockdown of interest. This tool is useful to investigators in reducing the time and cost of the drug discovery and repurposing process, allowing discovery of pathway inhibitors, which

56 can be further characterized through cluster analysis and characterization of structurally related compounds.

Methods

The strategy involved in Sig2Lead is to allow users to select their target gene to query iLINCS for any knockdown signatures46,47. After obtaining all knockdown signatures, any signatures of small molecule perturbagens that share a significant level of concordance with a knockdown of the target gene are identified, and the union of these small molecules is concatenated into a single file containing structural data for all of these compounds. Next, these compounds are converted to three dimensional structural data files (SDF) through the use of

OpenBabel72,81. Additionally, compounds are converted into binary vectors using the FP2 fingerprint library available in OpenBabel81 in order to have a metric for chemical distance calucluations. Using these fingerprints, Tanimoto distance is calculated pairwise between all compounds such that hierarchical clustering can be performed76. These distances are mapped into a distance matrix, which is then displayed as a heatmap with dendrograms displaying the overall topology of the clusters82. Users of this app can download the total list of compounds in their SMILES representation, their SDF files and the heatmaps generated through this clustering, along with the compounds that comprise each cluster and the compound that defines the centroid of each cluster. Also, the users can visualize a network of the target gene and the compounds in any selected cluster with their potential interacting partners based on the known and predicted chemical-protein interactions in Search Tool for Interacting Chemical (STITCH) database83.

57

Figure 13. Overview of Sig2Lead methodology. Sig2Lead utilizes iLINCS to query gene knockdowns of interest and identify concordant signatures. Once concordant signatures are obtained, Sig2Lead displays related compounds as a heatmap and MDS plot identifying centroids of each structurally related cluster. These representatives are reported to the end user as a reasonable starting place for characterization. All compounds identified via iLINCS are reported to the user as well in SDF format for molecular docking studies if requested. Sig2Lead serves as a fast high-throughput approach for identifying likely inhibitors based on their gene expression profiles as related to a gene knockdown. Using this approach, libraries can be filtered for only compounds with significant structural similarity to compounds driving a similar expression profile to a knockdown of the target gene.

In addition, Sig2Lead implements similarity searching of small molecules to increase the overall search space, by allowing users to expand upon existing clusters through similarity to small molecules from additional databases (without explicit connectivity). To accomplish this, compounds identified through the initial concordance to a targeted knockdown are measured for

Tanimoto similarity to all compounds in the NCI library of freely accessible compounds using the FP2 fingerprints in OpenBabel72,81. This similarity score is precomputed and stored internally

58 to increase the overall speed of the lookup. These similar compounds are added to the heatmaps of LINCS compounds, combining the transcriptomics approach of LINCS with cheminformatics to allow identification of related compounds that likely operate through the same target system.

These additional putative inhibitors are also readily available for free through the NCI.

Benchmarking of Sig2Lead

The Directory of Useful Decoys Enhanced (DUD-E) was a benchmark developed for determining the success rate of virtual screening methods by taking sets of known small molecule ligands to various target proteins, along with a library of small molecules known to lack activity with a given target and assess the method for its ability to discriminate between the known interacting partners and known decoys. To evaluate the performance of the ceSAR method, all datasets from the original DUD3884 were filtered for chemical similarity to any compound found within the LINCS small molecules, removing any without similarity to at least one compound at a Tanimoto similarity of at least 0.65. Such reduced, the compound library was utilized for all targets present within the DUD38 target datasets that had gene knockdowns available within LINCS, resulting in 21 targets (datasets) for benchmarking. Of these 21 datasets, one target, AHCY, was omitted from analysis due to unclear agonist vs. antagonist status for a large fraction of true positives in the DUD38 set. Subsequent analyses were performed using the remaining 20 datasets. After running the initial search, clusters were generated using Tanimoto cutoffs of 0.6, 0.65, 0.7, 0.75 and 0.85 as thresholds for removing non-clustered compounds.

Any added compounds that did not cluster with at least one LINCS compound were removed from the library. This allowed quantitation of false positives and false negatives, along with the overall enrichment of known binding partners of the DUD-E datasets.

59

Connectivity Alone for Inhibitor Identification

The method employed to benchmark transcriptomic connectivity alone versus Sig2Lead first required selecting a suitable gene target and inhibitors of the target. For this study, EGFR was chosen as it is a well-known target gene and has several known inhibitors. The known inhibitors of EGFR were chosen via the GeneCards website (https://www.genecards.org/) and grouped as either FDA-approved or investigational. Sig2Lead was run for the approved drugs and investigational drugs, performing clustering with a Tanimoto similarity of 0.7 to recover added inhibitors and structurally similar compounds present within LINCS.

Enchancing Virtual Screening

Vascular endothelial growth factor receptor 2 (VGFR2) was selected for virtual screening as a target present within the DUD-E database and used in benchmarking of the AutoDockVina docking screening server used in this analysis, MTiOpenScreen (http://bioserv.rpbs.univ-paris- diderot.fr/services/MTiOpenScreen)85. The VGFR2 protein structure (PDB: 2P2I) was collected from the Protein Data Bank (www.rcsb.org) and cleaned of bound ligands by removing all atoms of the nicotinamide ligand. The grid center was defined as the average of the original nicotinamide atomic coordinates (x:38.20, y: 35.47, z: 12.09), with a search space of 8,000 Å3 applied with a 20 x 20 x 20 Å grid box. A compound library of ligands and decoys for VGFR2 was retrieved from the DUD-E database (http://dude.docking.org)84 in the SDF format.

Sig2Lead was run on the enriched population after virtual screening using Tanimoto similarity cutoffs of 0.6, 0.65, 0.7, 0.75 and 0.85 to identify compound clusters. Any compounds from the starting library that did not include at least one LINCS compound within its cluster was removed for library reduction. This same analysis was also performed on VGFR2 in the absence of virtual screening for comparison.

60

Results

Sig2Lead as a Tool for Identification of Known Targeted Inhibitors from Pathway-Specific

Therapeutics

Utilizing a combination of transcriptomics and chemical similarity, novel pathway inhibitors can be identified. For characterization of this method, the DUD-E database84 was employed to determine the effectiveness of enrichment of known binding partners. Sig2Lead has demonstrated enrichment of ligands across a diverse subset of protein targets (Figure 14).

Within these datasets, enrichment was generally observed to be roughly 2-fold with a Tanimoto cutoff of 0.75 and generally increases more substantially with more strict thresholds but comes with the risk of exclusion of too many true positives. With larger libraries, these values may increase; as in some cases tested, the true positives consisted of only one or two distinct structural clusters, making characterization fall into an all-or-nothing identification.

After plotting the true positives recovered against the false positives included, it becomes apparent that strict Tanimoto cutoffs lead to an increase in the percent of the true positives recovered after using Sig2Lead relative to random sampling from the library (Figure 15).

Averaging 20 datasets, 27% of the true positives were recovered in a library that only recovers

6.6% of the false positives. Random sampling, however, would on average only recover 6.6% of the true positives with 6.6% screened of the false positives. These data show over a 4-fold enrichment over the expected true positives recovered at a 0.85 Tanimoto cutoff. Using the more relaxed 0.75 Tanimoto cutoff, these data showed roughly a 2-fold enrichment.

61

Figure 14. Sig2Lead enriches for inhibitors of target proteins. Fold enrichment of known positives based on DUD-E shows an increase in true positives as a percent of the remaining library. Median fold enrichment (blue) shows that upon strict thresholding (Tanimoto cutoff 0.85) within Sig2Lead drives an average of a three-fold increase in true positives, but can drive a significantly higher enrichment based on the 75th percentile (orange) and the maximum fold enrichment (gray). Using lower thresholds, lesser enrichment can be obtained, but with a lower risk of excluding positives. Minimum values (lower gray) and the 25th percentile (lower orange) do show some failure rates within the application, however, these values may represent data not present within the LINCS database at the current date. Outliers from the box and whiskers plots are displayed as points above or below each plot. Box and whiskers plots display the median, 25th and 75th percentiles and maximum and minimum values, while the line plots connect these points to illustrate that this is a continuous distribution.

62

100 Sig2Lead Random

80

60

40

True Positive True Rate Positive

20

0 0 20 40 60 80 100 False Positive Rate

Figure 15. Receiver Operating Curve (ROC) analysis shows an enrichment of true positives in screening of libraries submitted to Sig2Lead. ROC curves are plotted by taking the average of the remaining ligands from the DUD-E datasets within each fraction of the library screened at a given Tanimoto cutoff. The ROC area under the curve is 0.65 for running Sig2Lead without additional filtering as compared to the area under the curve for random guessing of 0.5.

Increased Sensitivity and Specificity of Transcriptomic Approaches in Drug Discovery through Chemical Similarity

The purpose of this characterization was to determine if applying the structural similarity filter over an existing transcriptomics filter would be able to identify drugs that would not be identified in a typical transcriptomics screen. To investigate this goal, EGFR was chosen as a target that has several known inhibitors, some of which are not present in the LINCS compound library. This library of FDA approved (Table 4) or investigational drugs (Table 5) targeting

63

EGFR was then searched for its inclusion within the iLINCS server when compared to an EGFR knockdown and additionally by running Sig2Lead with EGFR as the target.

The drugs that were in LINCS demonstrated the transcriptomics screening aspect of

Sig2Lead, and the drugs not in LINCS tested the screening based on structural similarity. 9

EGFR inhibitors were in iLINCS and another 7 inhibitors identified via Sig2Lead, but were not in iLINCS. This cast of compounds was ideal to demonstrate ceSAR’s two-pronged approach of genomic screening followed by structural similarity screening.

Table 4: Sig2Lead increases sensitivity of transcriptomic approaches for identification of FDA-approved drugs targeting EGFR. Five of the seven FDA approved drugs (Lapatinib, Gefitinib, Erlotinib, Afatinib and Neratinib) passed the genomics screen as their signatures were available in LINCS (iLINCS). The remaining two drugs, Icotinib and Vandetanib, were not found in the LINCS library, however, both of the compounds still clustered with distinct LINCS compounds by structural similarity.

Compound Present in Present in Cluster Cluster Cluster Centroid Name iLINCS Sig2Lead Number Size Lapatinib Yes Yes 1 94 LSM-4706 (F1566-0341) Neratinib Yes Yes 1 94 LSM-4706 (F1566-0341) Gefitinib Yes Yes 187 5 LSM-1098 (Gefitinib) Afatinib Yes Yes 236 4 LSM-42794 Erlotinib Yes Yes 260 4 LSM-1097 (Erlotinib) Icotinib No Yes 260 4 LSM-1097 (Erlotinib) Vandetanib No Yes 329 3 LSM-2921 (ZM 306416)

64

Table 5: Sig2Lead increases sensitivity of transcriptomic approaches for identification of investigational drugs. Canertinib, Pelitinib, Genistein, and Momelotinib were found in LINCS and clustered with structurally similar compounds. The remaining five compounds (AV-412, BMS-690514, BMS-599626, Dacomitinib, and Sapitinib) were not found in LINCS, but all clustered with at least one other LINCS compound with a Tanimoto cutoff of 0.70.

Compound Present in Present in Cluster Cluster Cluster Centroid Name iLINCS Sig2Lead Number Size BMS-599626 No Yes 1 100 LSM-4706 (F1566-3041) Canertinib Yes Yes 142 6 LSM-1120 (Canertinib) Dacomitinib No Yes 142 6 LSM-1120 (Canertinib) AV-412 No Yes 190 5 LSM-42794 Pelitinib Yes Yes 190 5 LSM-42794 Momelotinib Yes Yes 313 3 LSM-1141 (Momelotinib) Genistein Yes Yes 366 3 LSM-5549 Sapitinib No Yes 517 2 N/A BMS-690514 No Yes 719 2 N/A

Enhancing Docking Simulations for Identification of Small Molecule Inhibitors

Virtual screening is a typical approach for high-throughput small molecule library reduction for the enrichment of likely true positives. By applying an orthogonal approach, this library size can be reduced further, incorporating a greater percentage of true positives.

Typically, this additional screening is performed at the bench top, but with the application of

Sig2Lead, fewer compounds require wet lab testing to drive similar overall results.

65

Figure 16: Sig2Lead enhances sensitivity of transcriptomic analysis. Transcriptomic approaches are able to identify any compounds that have been tested with the assay, but are unable to identify those that are structurally related, and likely drive a similar phenotype. By applying Sig2Lead, 7 FDA approved or Investigational compounds from DrugBank were able to be identified that would not have been identified by applying transcriptomics alone.

Benchmarking Sig2Lead in the Context of Virtual Screening

Sig2Lead was tested with VGFR2 before and after running AutoDockVina through the

MTiOpenScreen web server85 for benchmarking based on the top benchmarks for

MTiOpenScreen. From analysis of Sig2Lead alone, VGFR2 generally performed slightly above the average, but was not the top performing compound. Having data for this target both in DUD-

E and from the previous MTiOpenScreen benchmark made this an attractive test case.

66

Comparisons to docking alone were performed in two steps. First, a small library of DUD-E compounds, which consisted of 1797 LINCS-related compounds, was utilized for comparing

Sig2Lead and MTiOpenScreen’s implementation of AutoDockVina side by side (Figure 17).

This analysis showed Sig2Lead outperforming traditional virtual screening, but was only run on a single target system. After running both programs independently, the two were run in tandem, using MTiOpenScreen as an initial filter, followed by Sig2Lead from the top 100 compounds predicted by MTiOpenScreen’s free energy predictions. This two-step filter was compared to docking simulations alone, showing a 2-fold enrichment of true positives in the final library

(Figure 18). It should be noted that the docking results are likely overly optimistic since the target protein structure used here was in bound form, i.e. it was crystallized in the presence of a small molecule inhibitor and thus represented a re-docking experiment.

This library, however consisted of a small number of compounds, so the whole DUD-E

VGFR2 library, consisting of 24,200 compounds, was utilized as an additional test for increasing the effectiveness of docking simulations alone. Typically, docking simulations would be performed to reduce the library size to a manageable number of compounds for in vitro characterization. Two cutoffs were tested for Sig2Lead enrichment, the top 500 compounds and the top 100 compounds as evaluated by predicted free energy of binding through

MTiOpenScreen. From these compounds, Tanimoto thresholds of 0.6, 0.65, 0.7, 0.75 and 0.85 were set within Sig2Lead and compounds that did not cluster with at least one LINCS compound were excluded from the final library. With the remaining compounds, true and false positive rates (Figure 19) were determined as the remainder of the starting true and false positives, along with the percentage of the remaining library that were true positives (Figure 20). These results

67 show that at all library sizes, Sig2Lead can provide additional enrichment of true positives, at least in the context of VGFR2.

14 MTiOpenScreen Sig2Lead 12

10

8

6

%True Positive 4

2

0 0 20 40 60 80 100

%Library Screened

Figure 17. Library reduction of Sig2Lead (red) and AutoDockVina through MTiOpenScreen (blue) shows an increase in true positives upon library reduction starting from the DUD-E library with structural similarity to something within the LINCS small molecule library. A library of 1797 small molecules from DUD-E including 31 known binding partners and 1766 known negatives was tested for library enrichment in Sig2Lead and AutoDockVina for targeting VGFR2. At the greatest library reduction tested, 47 compounds remained within the library, including 6 and 4 true positives for Sig2Lead and AutoDockVina, respectively.

68

30 MTiOpenScreen Top 100 + Sig2Lead MTiOpenScreen 25

20

15

%True Positives %True 10

5

0 1 10 100

%Library Screened

Figure 18. Sig2Lead can further enrich true positive populations from AutoDockVina by applying a transcriptomics filter to remaining compound libraries. From the 1797 small molecule library from DUD-E compounds similar to LINCS compounds, enrichment of true positives can be obtained through Sig2Lead after performing docking simulations (blue) rather than docking simulations alone (blue). The top 100 compounds obtained from MTiOpenScreen were submitted to Sig2Lead for structural similarity to compounds that produced a concordant signature to a knockdown of VGFR2.

69

16 MTi Top 500 + Sig2Lead MTi Top 100 + Sig2Lead 14 MTi Alone Random 12

10

8

6

True Positive Rate Positive True

4

2

0 0.0 0.5 1.0 1.5 2.0

False Positive Rate

Figure 19. ROC analysis of Sig2Lead after MTiOpenScreen on the full DUD-E dataset for VGFR2 shows increased reduction of false positives when Sig2Lead is run after docking simulations. Above, a zoomed in graphic of an ROC curve is displayed to observe the differences between running docking simulations alone, or a combination of docking and Sig2Lead. Significant enrichment occurs both in the presence and absence of Sig2Lead after docking simulations are performed, however, by adding this additional filter, increased enrichment can be obtained at low computational cost.

70

100 MTiOpenScreen Top 500 Sig2Lead MTiOpenScreen Top100 + Sig2Lead MTiOpenScreen Alone 80

60

40

%True Positive %True

20

0 0.1 1 10 100

%Library Screened Figure 20. Utilizing a combination of docking simulations and Sig2Lead drives an increased ratio of true positives within a small molecule library. Enrichment occurs with large libraries when combining traditional docking approaches with Sig2Lead, resulting in only a small number of compounds to screen in vitro with high success rates. When Sig2Lead analysis is performed starting from the top 100 compounds from docking simulations, the VGFR2 library can be reduced to just 21 compounds from the starting 24200, 16 of which are true positives.

Sig2Lead Results for BCL2A1 Docking

This approach has been tested in the context of the BCL2A1 project described in Chapter

II. To this end, compounds with high concordance to an A1 knockdown in 9 cell lines (HCC515,

HEPG2, HT29, MCF7, PC3, VCAP, A375, A549, HA1E) were identified and clustered by chemical similarity to those that have been ordered and tested in vitro having a measurable IC50 by fluorescence polarization (Figure 21). The cell lines tested in LINCS included some with inducible A1 expression (HCC515, HEPG2, HT29, A375 and A549) and some cell lines that

71 have little to no A1 expression (MCF7, PC3, VCaP and HA1E). The LINCS-derived compounds were compared to in vitro-tested compounds at each step outlined in Chapter II (Figure 22). The compounds that clustered at either a similarity of 0.8 or 0.6 showed that all hits identified previously had analogs in the LINCS library. This demonstrates good coverage of the docking- identified compounds for BCL2A1 and their derivatives in LINCS.

Figure 21. Compounds tested in vitro with an IC50<400 µM of BCL2A1 inhibition clustered with LINCS-derived compounds. P2 (red labels) and P4 (blue labels) compounds are shown hierarchically clustered with LINCS-derived compounds (black labels). This is the heatmap representation of the MDS in Figure 8.

72

120 0.6 0.7 100 0.8

80

60

40

20

%Compounds Found by Sig2Lead by Found %Compounds

0

Total Ordered Wild Type Active Top 300 both pockets DSF Significant Tm shift Dose Response IC50<50uM Dose Response IC50<100uM High Dose inhibition 3SDs below

Step

Figure 22. Percent of compounds from each level of in vitro analysis identified to have similarity to LINCS-derived compounds at Tanimoto similarity of 0.6, 0.7 or 0.8 show trends towards inclusion of all active compounds through structural similarity to compounds within the LINCS database. Compounds identified through in vitro analysis show trends consistent to those observed in LINCS-derived compounds. As experimental tests become more stringent, compounds omitted based on lack of structural similarity to LINCS compounds with significant signature concordance are more prone to be omitted due to lack of activity, driving an increase in compounds found by Sig2Lead.

73

Sig2Lead can be utilized before or after docking simulations to reduce library size without significantly reducing the number of positive compounds (Figure 23). When an ROC curve was generated using the BCL2A1 data gathered in Chapter II, it was observed that at 81% of the library size (after lenient screening with Sig2Lead), 94% of total observable IC50 values were conserved and 100% of compounds with IC50 values ≤100 µM were conserved within the library. More dramatic library reductions can be made where necessary, but will come with an increased false negative rate. In general, this approach can be used to accurately reduce library size, enriching for inhibitors of the target protein. However, since this application is used in identifying pathway inhibitors, there will still be false positives when looking at a specific target.

Figure 23 shows one such reduction, driving an enrichment of both inhibitors with IC50s≤100

µM and those that had a measureable IC50 below 400 µM. Thus, this approach to library reduction could have been utilized to remove over a third of the starting library for BCL2A1 screening and still recover 82% of the true inhibitors. Since 40 compounds can be ordered per month free of charge through the NCI library, this library reduction would have saved over a month, while still allowing detection of most true positives.

Discussion

Sig2Lead provides a novel tool for combining transcriptional connectivity with chemical similarity searching for the design of small molecule inhibitors targeting a pathway of interest.

Utilizing these approaches allows researchers to identify structurally distinct classes of compounds that drive similar transcriptomic profiles to those of shRNA knockdowns of the target. By identifying these diverse clusters, researchers can prioritize to rapidly locate relevant chemical space to their target of interest. As can be seen from Figure 14, this new method has

74 yielded enrichment for true inhibitors in all but 2 targets included in the subset of DUD-E benchmarks considered here.

Figure 23. Reduction of compounds ordered for screening of BCL2A1 by running Sig2Lead and generating clusters with Tanimoto similarity of 0.65 results in enrichment of active compounds. By running Sig2Lead as an additional filtering step prior to ordering compounds identified by virtual screening in Chapter II, the number of ordered compounds would have been reduced by 51 compounds, of which, only 6 had detectable IC50 values under 400 µM. This reduction leads to a 1.26-fold enrichment in overall detectable IC50s and a 1.35- fold enrichment of compounds with IC50s ≤100 µM.

First, Sig2Lead has demonstrated that it can be used alone to drive approximately 3-fold enrichment of known interacting partners when tested against an existing docking benchmark comprised of a diverse set of target proteins. Existing compound libraries can be added in their

SMILES format to Sig2Lead to quickly filter out compounds not expected to provide similar expression signatures to a knockdown of their target based on chemical similarity to known compounds. In the absence of a library of interest, compounds that are already present in LINCS

75 can be used as the library. Many of these compounds are already known inhibitors of a protein target, and could be repositioned to address additional conditions. Moreover, some compounds present in the LINCS library or user-provided library may be useful as generics in place of existing FDA-approved drugs or could be used for SAR approaches to improve initial lead compounds that are within LINCS or the NCI library. This approach would be useful in new or poorly funded labs as a means to generate preliminary data.

Sig2Lead has demonstrated increased sensitivity in library searching over transcriptional connectivity alone through the addition of structural similarity. A typical workflow within a transcriptomic analysis would identify perturbagens that drive a similar profile to a target pathway. While connectivity alone will capture inhibitors if the expression profiles are present in the database, many compounds are structurally related to those demonstrated to drive a given profile but have not been tested themselves. Such compounds could be inferred without the direct need for transcriptomic analysis of the specific compound in cell lines. Using this approach, known inhibitors can be identified within the added chemical similarity threshold that would not have otherwise been found through traditional transcriptomic approaches.

Finally, Sig2Lead has yielded enrichment of small molecule libraries into true positives in combination with docking simulations. Typical docking work flows are designed in an attempt to enrich an existing small molecule library for compounds expected to interact with the target protein. However, docking simulations can result in large false positive rates, especially if biases, e.g., to large hydrophobic moieties, are not corrected. This limitation can be overcome using an orthogonal approach to library reduction. Sig2Lead applies a rapid transcriptomic filter that only requires a knockdown signature to be present in LINCS data sets or a user defined expression signature that has significant overlap with the L1000 genes. This filter is shown to have a

76 multiplicative effect on the already enriched docking subsets (Figure 19), allowing users to find a subset of compounds predicted to inhibit something within the target pathway and to potentially interact with the target protein.

In summary, Sig2Lead utilizes ceSAR, a method for increasing the overall effectiveness of the drug design process, allowing existing datasets to do the heavy lifting to save hours of bench work in the lab. This application requires minimal structural information about the target, a limiting factor in typical virtual screening approaches, and can be run on a personal computer in a short amount of time. By applying this method in combination with existing in silico techniques, further enrichment can be obtained, saving time and money and providing researchers preliminary data without large expense.

77

Chapter IV: Conclusions and Future Directions

78

Identified Inhibitors of BCL2A1

In Chapter II, a protocol was described in which 90,086 potential compounds were reduced to just 13 that had significant activity in the inhibition of BCL2A1. This discovery process used in silico approaches to dramatically limit the chemical search space while enriching for likely inhibitors, driving biochemical studies with purified protein. Once tested in vitro, compounds were expanded upon to produce a small SAR study, thus identifying compounds with better inhibition of BCL2A1-Noxa binding. After performing in vitro binding assays, compounds with the best inhibition of the protein-peptide interface were tested in vitro using primary splenocyte cells. These cell assays revealed two compounds which outperformed the others, both predicted to interact with the P2 pocket. In early studies, one of these two compounds was shown to function specifically through the apoptotic pathway based on its lack of inhibition in Bax/Bak deficient cells at low concentrations. The 13 compounds that are shown to disrupt Noxa binding from FP studies will need to undergo some additional refinement through additional testing of SAR compounds. As part of this study, P2 compounds will be grown into the P4 pocket and P4 compounds will be grown into the P2 pocket for increased sensitivity and specificity. These studies aim to develop a drug for treatment of autoimmune diseases and cancer.

HT1080 Cell Line Cell Death Assay

The primary cell death assays identified two compounds capable of driving cell death; however these compounds seemed to drive cell death through general cytotoxicity when doses are sufficiently high. Using primary cells is a time-consuming process, requiring optimization of experiments that already take around a week to complete. To provide a more rapid screen for determination of cell toxicity, a cell line assay can be performed. This assay will allow a higher

79 throughput in screening of additional compounds for determination of cell toxicity that may occur in non-specific fashions.

To this end, HT1080 cells will be utilized as described by the Baldwin group13. These cells will be stably transfected to overexpress A1 with a flag-tag in a pcDNA3 expression plasmid that includes a neomycin resistance gene for screening of cells that incorporated the plasmid and tested for A1 expression by use of an anti-flag monoclonal antibody. This cell line has demonstrated expression of A1 in the presence of NF-κB, but not in its absence13.

Once stable transfection is achieved and expression of A1 has been confirmed in this cell line, compounds can be tested in a cell death assay similar to the primary cell death assay described in Chapter II. This assay will allow a higher-throughput analysis by allowing for more tests to be performed in a shorter amount of time. Compounds will be incubated with cells expressing A1 and cell death will be compared between the stably transfected HT1080 cell line and HT1080I cells that express IκBα, thus driving apoptosis by blocking NF-κB signaling which has been demonstrated to activate A113. This approach will compare cell death through a trypan blue inclusion assay to determine if cells can be driven to apoptosis through the inhibition of A1, driving cell death upon stimulation with TNF-α. Expected cell death through A1 inhibition can be determined through TNF-α induction of the HT1080 cells transfected with an empty control vector, thus not containing the pro-survival functionality of A1. These cells should die upon stimulation with TNF-α, and cells treated with compounds that drive cell death through A1 should drive similar levels of death. Cells tested in the absence of compounds will establish the background killing exhibited through TNF-α treatment. Additionally, these cells can be treated with compounds in the absence of TNF-α induction to determine if the compounds exhibit general cytotoxicity. It should be noted however, that this assay will identify small molecules

80 that are able to drive cell death on their own, but this is not necessarily the expectation, as these compounds are serving as intermediate building blocks. These assays will be useful for identifying small molecules that demonstrate general cytotoxicity in a high-throughput manner.

Structural Resolution of BCL2A1 Inhibitors

Crystallization screens will be performed in combination with small molecules shown to disrupt Noxa binding. Currently, A1 and Bfl1 have been crystallized in the presence of BH3 peptides, but not in their apo-form. The apo-A1 structure will be determined and followed up with crystal soaks of top-performing small molecules. Once structures are solved with these small molecules, the proper binding poses will be determined to aid in design of linker regions for rational growing into full-length drugs.

Development of Identified Inhibitors Through SAR and Compound Growing

SAR studies need to be performed on P2 and P4 targeted inhibitors. NSC-97318 in particular, has some concerning features that would preferably be removed in favor of isosteric groups. Specifically, the di-azo bond bridging the benzene and naphthalene rings may have hepatotoxicity or may be metabolically cleaved86. Since this connects the predicted P4 linking sulfonate group to the region targeted to P2, cleavage at this location would likely reduce the effectiveness of the final drug. To prevent this, reduction of the di-azo bond to an azine bond or replacement with an isosteric alkene bond will be tested. Additionally, the sulfonate group will be converted into a sulfone group as a bridge to the P4 pocket. The reduction to a sulfone should increase the permeability of the compound87, hopefully removing the need for the addition of agents such as polybrene for crossing the cell membrane.

81

Development of active inhibitors will require the P2 and P4 pocket inhibitors described in

Chapter II to be tethered together to increase the affinity of binding, while maintaining specificity. Successes in these types of approaches often rely more on growing a particular inhibitor or fragment than simply attaching the two inhibitors together88. This is due to atoms having specific bond angles and bond lengths which impose restraints on available linkers54. For this reason, to develop on the inhibitors previously identified, lead compounds for the P2 pocket will be gradually built into the P4 pocket, starting with NSC-97318 and NSC-15508 (Figure 24), since they have both shown activity in primary cells. Some synergy experiments have been performed already; however, no results have shown any activity better than additivity. This was not unexpected, as independently compounds have low binding affinity and may be competing for part of their respective pockets, but as the inhibitor is grown into existing P4 inhibitors, it is expected that the affinity will be greatly increased by driving a local concentration increase after binding to one pocket.

To test this hypothesis, each lead compound will be built into the opposite pocket, building in the moieties of other known binders. Based on the results from Chapter II, five P2 inhibitors and eight P4 inhibitors were identified with IC50<50µM, which would result in 40 unique combinations along with any additional compounds found through LINCS related compounds. Each of these combinations will require varying lengths of linkers to test for optimal connections. These syntheses will be performed by a medicinal chemist and once bridged molecules are generated, fluorescence polarization will be performed, as described in Chapter II, to determine efficacy.

The top-performing bridged compounds will be tested not only against BCL2A1, but also against purified BCL2 and MCL1. These screens will allow in vitro detection of off-target

82 effects that would likely be problematic. After performing these screens, bridged compounds that showed activity with purified BCL2A1, but not with BCL2 or MCL1 will be tested in vitro using primary splenocyte cells cultured for activated T-cells. Wild-type and Bax/Bak-/- primary cells will be tested in this assay to determine if driving cell death is specific to the intrinsic apoptotic pathway or if its due to a more general cytotoxicity. Any specific compounds identified through these screens will be tested further to achieve FDA approval (likely in an industry setting).

Preclinical Trials

Once inhibitors are fully developed, preclinical trials must be conducted to explore the potential toxicity of the drug before testing in human patients. These trials are two-fold: in vitro testing in cell lines to establish dosing requirements when directly applied to cells of interest, and in vivo testing of the in vitro derived doses to identify the pharmacokinetics and toxicity of the drug in mice. Initial in vitro characterization has been performed, as described in Chapter II, suggesting an interaction with a number of current small molecules. Prior to preclinical trials, bridged compounds will be generated as described above. These bridged compounds will be tested in vitro against primary splenocytes enriched for activated T-cells. The dosing identified in this fashion will next need to be tested in mice to determine any potential morbidity associated with the treatment.

83

A B

C D

Figure 24. Chemical structures of NSC-97318 and NSC-15508 for targeting the P2 pocket of BCL2A1. NSC-97318 (A) and NSC-15508 (B) will be grown into the P4 pocket from the P2 pocket to develop two inhibitors that are highly specific and sensitive to the BH3 groove of BCL2A1. NSC-97318 will be grown from the sulfonate group, which is in common with many of the P4 inhibitors and located closest to the P4 pocket (C). NSC-15508 will be grown from the benzene ring currently lacking in functional groups, as this is the least likely to cause steric clashes based on top predicted poses (D).

Intraperitoneal injection of bridged molecules at four dosing levels (5, 50, 500, 2000 mg/kg) will be performed in five female and five male C57/B6 mice at each dose, according to the fixed-dose procedure for determination of acute toxicity89. Following this procedure will

84 limit the number of animals sacrificed and allow observation of a wide range of toxicity measurements. When clear signs of toxicity are observed, the drug will be classified as very toxic, toxic, harmful or unclassified toxicity dependent on the level at which toxicity is first observed. Using these metrics, no more than 10% of a toxic dose in mice will be used as a safety correction factor when transitioning to human trials as described in the EPA guidelines90. In the described study, the dose (in mg/kg) identified in the mouse work will be divided by 12.3 for the human transition (a correction factor based on body surface area)91. After performing these preclinical trials, an Investigational New Drug (IND) application would be submitted prior to beginning clinical trials.

85

Future Developments of Sig2Lead

Sig2Lead aims to bridge transcriptomic data and structural similarity to rapidly identify putative lead compounds. In Chapter III this application was described and benchmarked, showing an increased specificity over a random library or docking enriched library, along with an increased sensitivity over transcriptomic approaches alone. This application allows users to generate data from unfiltered compound libraries in an afternoon from their personal computers, without the need for large expenses or structural information about their target of interest.

Currently, this approach can be run on any human gene with a knockdown present in LINCS, or on user defined signatures that have an overlap with the L1000 genes.

Sig2Lead is an application still largely in its infancy. While the core functionality is intact and informative, there are several additional features that are desirable in the final program. Within the next several months to years, these additions will be added to the application and be available with the R package, or available as a web server. A few of these features will be described in detail below.

Web Server and User Interface Updates

The first feature that will be developed is a web server for Sig2Lead. Currently, the application is available only as an RShiny app, but this is not the most user-friendly implementation. The goal in developing a web interface is to build a simple input for ease of use in research. Additionally, the current distribution of Sig2Lead lacks functionality on MacOS, due to its dependency on ChemmineOB, which has an issue with installation on Macs (this can currently be accomplished, but is no easy feat). By developing Sig2Lead as a web application, this problem should be able to be circumvented, allowing access to the app from any users. The

86 web interface will be professionally designed by members of the Meller group and will feature an interactive heatmap, as opposed to the current static heatmap and will have access to the

LINCS database, increasing the search speed relative to the RShiny app.

In addition to implementing the web interface, the existing user interface will be extended to allow ease of access of all data obtained by Sig2Lead. Currently, clusters in the MDS plot are displayed as the cluster number, so when the user downloads the CSV file containing the compound names and cluster numbers, the relative positioning to other clusters can be visualized. This view is functional, but for ease of access, Sig2Lead’s user interface will eventually allow hovering over a pie chart to see the cluster number and all members of the given cluster. By introducing this feature, it will be easier to determine which compounds are related to those that are added by the user.

The user interface will also include significantly more information about each individual compound. Some of the LINCS compounds are known inhibitors with a listed mechanism of action or treat a condition but have unknown targets. For this reason, compounds are currently hyperlinked within the user interface, allowing users to follow the link to the information page of each compound, but some of the most important data could be presented on the page within the application. This will let users quickly determine if their tested compounds are related to known inhibitors of their target, or if a particular compound cluster could be associated with a target based on relation to a known inhibitor of that target.

Multi-Gene Searching

In addition to the web interface, Sig2Lead will have additional features implemented as optional searches. Firstly, users will be able to input multiple genes known to exist together in a

87 pathway or that they would like to see inhibited or activated together. Since Sig2Lead identifies likely inhibitors of pathways, instead of specific targets, addition of other pathway members should remove many of the false negatives identified through similarities in gene expression profiles. The user will be able to specify up to three genes as a starting point, which will additionally be listed as either activating or inhibitory for the pathway of interest. This step will be crucial to identification of pathway inhibitors as any inhibitory genes will be searched for discordant signatures, while activating genes will be queried for concordant signatures. After searching for each gene separately, the intersection of compounds for each will be identified for clustering. This implementation relies heavily on user-defined genes to derive the pathway and does not require Sig2Lead to have prior knowledge of the pathway of interest.

Automated Docking of Sig2Lead Results

Sig2Lead is currently designed to provide the SDF files for all compounds identified from iLINCS to allow researchers to reduce their library size prior to virtual screening. As an additional option in the future, users will be able to submit these results directly to a virtual docking server. Ideally, a server hosted by the CCHMC computing cluster will allow users to directly utilize AutoDock4.2.6 or AutoDockVina92 with relatively low numbers of evaluations for a quick search, but the SwissDock93 or MTiOpenScreen85 online server could be utilized as an alternative. The user would be responsible for inputting the PDB code and optionally the grid box information. In the absence of a specified grid box, the application would default to a grid box that surrounds the entirety of the protein, thus increasing the search area and required time to run. A short script to find a grid box that includes the whole protein with a small buffer around it that compounds could fit into will be written to automate this process. In the meantime, users

88 will be directed to MTiOpenScreen with their SDF files which runs AutoDockVina on the cloud for up to 5000 compounds (that could be downloaded from Sig2Lead) with their target85.

This feature would allow users to test compounds identified from the Sig2Lead screen for interactions with specific proteins from their target pathway. This step would be useful to users that want to inhibit a specific member of the pathway instead of the pathway as a whole by providing predictions about which protein most likely interacts with the compounds. After performing this docking analysis, users will have both transcriptomics and structure-informed data, strongly suggesting inhibition of a specific protein within a pathway of interest.

Increased Similarity Searching

Currently, Sig2Lead employs a similarity search based on Tanimoto coefficient to compounds in the NCI library. These are readily available compounds that researchers can obtain free of charge from the NCI for in vitro and in vivo testing and provide a wide range of compounds covering a significant chemical search space. Chemical space however, is vast and the LINCS compounds with the NCI library only make up about 350,000 compounds total of the potentially 10^63 organic small molecules that theoretically could exist94. To account for this shortcoming, Sig2Lead will utilize two additional sources for similarity searching.

First, Sig2Lead will allow similarity searching within the FDA approved drug list. This list only adds about 2,500 more compounds, but all have been FDA approved. Adding FDA approved drugs will allow users to compare a knockdown of their gene of interest to compounds similar to FDA approved drugs, many of which are also present within LINCS compounds. By searching FDA approved drugs, users may be able to reposition drugs to new targets, allowing them to bypass much of the bureaucracy involved in determining the safety of a novel drug.

89

Additionally, Sig2Lead will allow a slow similarity search between LINCS compounds and the ZINC database run by the Stoichet lab95. This database is the largest repository for small molecule structural information and includes over 200,000,000 compounds. By mapping similarity between LINCS compounds and the ZINC database, users will be able to sample a significant piece of the potential chemical space organic compounds can occupy. Additionally, some of the compound families that have not been thoroughly tested in the L1000 assay used in

LINCS may be able to be expanded with similar compounds extracted from ZINC.

By expanding the similarity searching options in Sig2Lead, users will be able to expand results through a SAR approach. Once implemented, users will be able to identify compounds related to those of interest as a means to further enrich their initial libraries for virtual screening or to identify all related compounds to those known to drive a similar effect to a knockdown of their target pathway. This search will allow users to find all information about lead compounds in one place, easing the steps in rational drug design.

Sig2Lead Towards Personalized Medicine

Currently, Sig2Lead is early in development for drug development and design; however, the application is fast enough that it could one day be used for identifying the correct drug for treatment of diseases with multiple distinct treatments in a more clinical setting. To this end,

Sig2Lead will accept primary data from an L1000 or similar assay and identify compounds that produce concordant and/or discordant signatures, depending on user preferences. This type of analysis relies on the same core architecture as running Sig2Lead for drug discovery purposes, with the notable exception that doctors/pharmacists using the application would need to use FDA approved drugs of interest as the added compounds. Additionally, Sig2Lead will need to output two additional pieces of information, the cell line tested and the concordance value associated

90 with the signatures. With this information available, it becomes possible to match cell lines to the affected tissues, removing some of the non-specific compounds that may only act in certain cellular environments.

To achieve this goal, benchmarking will be performed on datasets of patients known to have a disease with multiple distinct treatments. Cancer for instance is moving towards precision medicine through tumor profiling with RNA-seq96. Once patient expression data become abundant, optimization of Sig2Lead with de-identified samples with known treatment courses could lead to its use in determination of the best treatment for a given patient sample.

Performing L1000 profiling is significantly cheaper (~$2/sample in reagents)45 than whole genome sequencing (~$1000/genome)97, reducing the overall cost to the patient, while still capturing over 80% of the true linkages45. The L1000 assay can be performed on thousands of samples/week, providing a high throughput screening platform for patients. Currently the turn- around time is relatively slow due to the L1000 assay not being widely accessible; however, as more clinical labs adopt this technology, Sig2Lead will be positioned to rapidly expand in usefulness in the clinic.

91

Appendix I: Characterization of CovRS/ArlRS Two-Component Regulatory Systems as a mechanism for virulence regulation in Streptococcus pyogenes and Staphylococcus epidermidis

92

Introduction

Gram positive organisms are often involved in disease within mammalian hosts. These diseases are very diverse, from mild skin infections to lethal infections. The diseases caused by

Streptococcus pyogenes spans this range, causing impetigo and strep throat to necrotizing fasciitis and toxic shock syndrome98. Other major contributions to disease by gram positive bacteria are those contracted as opportunistic infections by the organisms that are typically present as normal flora. Staphylococcus epidermidis causes a large proportion of these nosocomial, or hospital acquired infections, especially in patients with joint replacements or other indwelling medical devices99.

S. pyogenes has a large assortment of virulence factors that are differentially activated to result in the various diseases caused by the organism98. S. epidermidis on the other hand, is normal flora on the vast majority of the population and activates its virulence factors as a survival and adhesion mechanism, often on plastic or artificial surfaces introduced during medical implantation99. While these organisms are very different when it comes to their virulence factors, it seems that they share a common mechanism of regulation of the virulence factors they do contain. The organisms both contain two component systems which are responsible for the regulation of the virulence or camouflage of the organism100–102.

The two component systems, CovRS and ArlRS, have numerous known targets of regulation, but are structurally poorly studied. CovS has been demonstrated to regulate roughly

15% of the bacterial genome primarily genes involved in virulence102. ArlS has been shown to regulate biofilm formation of S. epidermidis, along with a number of other genes typically associated with bacterial survival100,101. Based on these regulations, it is attractive to target these systems for disease prevention through protein interface-targeted therapeutics.

93

Streptococcus pyogenes

S. pyogenes is a gram positive facultative anaerobe that is beta-hemolytic when grown on blood agar plates. It is a group A streptococcus (GAS) that forms clusters of bacteria when cultured and exists as normal flora in the respiratory system of 5-15% of the population. While the organism can exist as normal flora, it causes disease in many of the hosts colonized ranging from skin irritation to severe invasive disease. This large difference in diseases caused by the same organism is due primarily to a switching mechanism in the activity of the CovRS two- component system103.

Streptococcus pyogenes pathogenic mechanisms

GAS produce infection in mucous membranes, tonsils, skin and deeper tissues through invasion and bacterial sepsis. S. pyogenes is an extracellular bacterial pathogen that can lead to many severe complications, many of which are fatal. There is a large number of virulence factors produced in GAS, many of which are involved in the development of the more severe invasive diseases. These factors include M protein, streptolysins O and S, Sda1 (DNase), pilin, streptokinase, hyaluronidase, C5 peptidase, extracellular pyrogenic exotoxins A, B, and C

(SpeA, SpeB and SpeC), along with exotoxin F (mitogenic factor)103.

In addition to the several virulence factors above, S. pyogenes also has a number of superantigens104. Superantigens act through a different pathway than canonical antigens. Instead of the specificity between class II major histocompatibility complexes (MHC) and T-Cell receptors (TCRs), superantigens bind the MHC and the Vβ of the TCR non-specifically. This interaction causes activation of a massive amount of T cell activation, releasing large amounts of interleukins (IL), tumor necrosis factor (TNF) and gamma-interferon (IFN). Superantigens do not require the processing by an antigen presenting cell (APC) that typical antigens require to

94 cause activation of T Cells. Additionally, the subset of T cells activated by superantigens is significantly higher than conventional antigens since they bind the Vβ of the TCR instead of the variable chain. Each of these superantigens has distinct recognition of TCR Vβ regions105. It has been shown by the Kotb lab previously that these superantigens are important in disease pathogenesis, not only in severe invasive disease, but in acute nasopharyngeal infection as well104,106. In addition to superantigens being important in disease pathogenesis, it seems that the

MHC haplotype of the host is also important in the severity of the disease caused by the bacteria104.

The normal colonization site for GAS is within the nasopharynx of the mammalian host, thus binding to this site is typically the first stage of the infection. This process begins with a weak interaction with host mucosal membranes followed by highly specific binding, allowing competition with normal flora. There are a number of various adhesin molecules within GAS, possibly for adhesion to different surfaces for colonization and infection including lipoteichoic acid (LTA), M protein, protein F/Sfb, FBP54 and another small fibronectin-binding protein which interact with fibronectin; along with other proteins that bind galactose, vitronectin and collagen. The bacteria also have glyceraldehyde-3-phosphate dehydrogenase and a hyaluronic acid capsule for adhesion98.

Once the bacteria have colonized the host, they are able to mitigate phagocytosis utilizing the M protein and the hyaluronic acid capsule. The M protein interacts with fibrinogen, masking the surface of the bacteria and blocking the activation of complement by reducing C3b binding of streptococci. These M proteins can be bound by antibodies specific to the specific M protein epitope which allows recognition through neutrophils triggering neutrophil extracellular traps

(NETs)107. M protein has also been shown to induce proliferation of T cells108. The C5a

95 peptidase expressed by the bacteria also contributes to immune evasion after colonization. This protein cleaves the complement factor C5a, thus reducing chemotaxis of polymophonuclear leukocytes (PMNs). Another mechanism for avoidance of phagocytosis is with the streptococcal inhibitor of complement-mediated lysis (SIC). This protein has been shown to be hypervariable and is responsible for inhibiting the membrane attack complex of the complement pathway.

Regulation of Virulence

Because there are so many virulence factors, and the bacterium is able to exist in so many various niches, there is a large degree of virulence regulation. S. pyogenes has global regulators of virulence, Mga and CovRS/CsrRS, along with an extracellular cysteine protease, SpeB, which degrades many of the virulence factors expressed only in the more invasive diseases. Mga has similarity to a recognition receptor of two-component systems and is primarily important for controlling expression of the M protein.

A second two-component system, csrR-csrS or covR-covS, was identified by the Wessels group to decrease transcription of the has operon (the operon encoding the hyaluronic acid capsule)109. A mutation in this two-component system was shown to increase transcription of a large number of virulence factors independently of growth phase including: streptokinase, streptolysin O and mitogenic factor. While there is a vast array of virulence factors present within GAS, many of them are routinely degraded by the organism allowing a sort of camouflage while colonizing the host103. This degradation of virulence factors occurs via the pyrogenic exotoxin, SpeB, an extracellular cysteine protease110. SpeB is a secreted protease that is self- processed after secretion into a 28kDa active form111 and is regulated through CovRS; with decreased expression in CovRS mutants initially seen as lower antibody titer in invasive disease patients112. The antibodies produced against SpeB have shown to be protective in mice;

96 prolonging survival of mice infected with mixed populations of S. pyogenes strains113. SpeB has been shown to be necessary in skin infection, but is typically inversely related to severity of disease111.

Treatments

GAS infections are treatable with penicillin or amoxicillin in the less severe disease states, but are difficult to treat once they have become severe invasive diseases114. Generally, β- lactams are effective in treatment, while aminoglycosides and macrolides are less effective.

Some fluoroquinolones are useful, but others show little activity115. In many necrotizing fasciitis cases, if early treatment with antibiotics proves ineffective, removal of the necrotic tissue will be performed if possible and amputation will follow if necessary114. Since this severe treatment is the current practice, it is important to find alternative approaches to treatment of severe invasive diseases caused by GAS.

Staphylococcus epidermidis

Staphylococcus epidermidis are Coagulase-negative staphylococci (CoNS) and are members of the normal flora in all humans116. Normally, these bacteria are involved in maintaining homeostasis and prevention of colonization of disease-causing agents. S. epidermidis interferes with Staphylococcus aureus colonization through the serine-type protease,

Esp; preventing S. aureus biofilm formation117. Due to the ubiquitous nature of S. epidermidis, the bacterium is often found as a contamination of indwelling medical devices99,118.

Diseases

While S. epidermidis makes up a large portion of the normal flora, it is often associated with opportunistic or nosocomial infections. These infections are typically only problematic in

97 patients with predisposing factors. These factors include premature birth, any immunosuppression and most significantly, implanted medical devices118. Most catheter related infections are the result of S. epidermidis and cause disease in 6.8 per 1000 catheter implanted patients. It was found that CoNS cause 43% of nosocomial blood stream infections with 80% of those caused by S. epidermidis119,120. In addition to catheter infections, S. epidermidis is the third most common disease-causing agent in infective endocarditis and the most common in prosthetic valve infective endocarditis121. A third major infection with S. epidermidis is within prosthetic joints. Between 30% and 50% of infections of hip and knee replacements have been attributed to be caused by CoNS and about 80% of those are S. epidermidis122–124. In 1982, Peters et al. described the first observation of what we now know to be biofilms of S. epidermidis colonizing central venous catheters125. This seems to be the primary mode of infection of foreign-material infections and is introduced during the surgery or implantation of the device. In staphylococcal biofilm infections of indwelling medical devices, treatment usually involves replacement of the device in question126.

Biofilm Formation in S. epidermidis

The primary virulence factor in S. epidermidis is biofilm formation. This process involves three main steps: primary attachment, accumulation and detachment127. The primary attachment phase involves colonization of some non-self surface. The bacteria then replicate and accumulate forming a multi-layered architecture. In this phase, most bacteria are not directly attached to the surface colonized in the primary attachment phase128. Finally, the biofilm disseminates and spreads.

As mentioned above, the first step in biofilm formation is primary attachment. In the case of S. epidermidis, attachment to polystyrene is possible and thought to be formed through the

98 autolysin, AtlE129, through turnover of cell wall and binding unmodified polystyrene130. The exact role of AtlE in primary attachment is not completely demonstrated and it may be a more secondary effect on attachment131. These artificial surfaces are the primary concern of biofilm formation when it comes to disease, but it is known that S. epidermidis also expresses surface proteins that interact with the extracellular matrix (ECM) of the host, allowing for primary attachment on host cells along with medical devices that have been coated in host ECM upon insertion132. In addition to AtlE, S. epidermidis expresses the proteins of the serine-aspartate repeat family which are known to interact with ECM components127,133. It has also been shown that S. epidermidis biofilm structure includes extracellular DNA (eDNA) which is released through autolysis in biofilm formation as mediated by AtlE134. This eDNA was shown to be important in the transition from primary attachment to accumulation phase on glass, probably for intercellular adhesion135,136.

Biofilm accumulation progresses utilizing intercellular adhesion mechanisms in which new cells aggregate to those that have attached to the primary surface resulting in a multilayered biofilm127,137. The most extensively studied accumulation factor is the polysaccharide intercellular adhesin (PIA)138. What was initially described as a single polysaccharide actually exists in two forms within S. epidermidis, the major polysaccharide I and the minor polysaccharide II138.

PIA forms aggregates in solution, but to be fully functional requires the expression of the full ica operon, icaADBC138,139. IcaA is a transmembrane glycosyltransferase 2 family member that allows for synthesis of N-acetylglucosamine (GlcNAc), an important oligosaccharide in intercellular adhesion139. The second gene in the operon encodes IcaD, a protein needed for activity of IcaA. This protein may be functional as a chaperone necessary for folding and

99 membrane insertion of IcaA139. IcaB is a polysaccharide deacetylase which is necessary to include the small portion of non-acetylated GlcNAc, but PIA can be produced in its absence at a low level140. IcaC has several transmembrane domains and may be involved in the externalization of the growing PIA molecule139. There is additionally an icaR gene which is a member of the tetR family of repressors and has been demonstrated to repress expression of the ica operon in the presence of ethanol or high salt141. This operon is also regulated by the SarA protein, which activates ica by binding the promoter region142.

The production of PIA is a major player in biofilm formation in S. epidermidis, but there are a number of proteins that are involved in regulation of the accumulation and primary attachment phases of biofilm formation. The accumulation associated protein (Aap) and the extracellular matrix binding protein (Embp) play major roles in both attachment and accumulation127. Embp is primarily involved in primary attachment to fibronectin coated surfaces, but seems to show some importance during intercellular adhesion in the accumulation phase143.

Aap is a protein studied extensively in the Herr lab which has been shown to be important in biofilm formation144. Aap is linked to the cell wall covalently and contains multiple domains including an A-repeat and α/β domain which are proteolytically cleaved in the processed form, followed by the B-repeat region, a Proline/Glycine rich domain and the LPXTG anchor motif144.

The α/β domain is highly conserved between Aap and its S. aureus homolog, SasG, and is predicted to have lectin-like activity145. The full length protein is not able to induce intercellular adhesion without being processed through some proteolytic cleavage event146. This protein is involved at both the primary attachment and accumulation phases of biofilm formation.

100

The A-domain of Aap is thought to play a role in the primary attachment phase147. This domain was shown to be important in adhesion to polystyrene through knockout mutations and anti-Aap domain A which greatly diminished adhesion147. This finding was later confirmed through knock-in experiments allowing improved plastic binding over wild-type or Aap B domain strains145. The B-domain of Aap consists of 5-17 128 amino acid repeats, each of which containing a 78 amino acid G5 domain and a 50 amino acid linker domain148–150. These B-repeats have been shown to oligomerize in the presence of zinc forming an antiparallel twisted cable requiring at least 5 B-repeats and are thought to form a sort of Velcro between bacteria within a biofilm144. Additionally, the B-domain is seen to be variable even within clonally isolated clinical strains suggesting they may be involved in immune evasion by modulation of cell surface proteins during disease progression148.

Two-component Systems

Two component systems (TCS) involve two proteins, a membrane anchored sensor histidine kinase and its cognate recognition receptor, which recognize a periplasmic or extracellular ligand inducing phosphorylation of the sensor which then transfers the phosphate group to a conserved aspartic acid residue on the recognition receptor151. The recognition receptor then acts as a transcription factor, either activating or repressing target genes151. The sensor histidine kinases typically make two transmembrane passes with a short intracellular N- terminal followed by the first transmembrane helix and a globular sensor domain. After the sensor domain is the second transmembrane pass often followed by a HAMP linker region, a histidine kinase domain and an HATPase domain allowing the breakdown of the ATP molecule151. Dimerization is typical and important in the signaling mechanism of most histidine kinases. Within the cytoplasmic domain of the protein is a conserved histidine residue that will

101 be autophosphorylated in trans when dimerized151. Additionally, in EnvZ, one of the better characterized histidine kinases, it is thought that the extracellular ligand causes an asymmetric displacement of one of the subunits within a dimer, altering the overall structure of the protein for signal transduction151. The cytoplasmic domains of this protein family are more widely conserved than the sensor domains suggesting the diversity of ligands for this signaling pathway151. While these proteins are called sensor histidine kinases, most are actually bifunctional molecules and possess both kinase and phosphatase activity151.

CovRS

CovRS (control over virulence receptor and sensor)or CsrRS is a TCS that involves two proteins, CovR and CovS, which are the recognition receptor and sensor histidine kinase, respectively152. This protein system is involved in regulation of 10-15% of the bacterium’s genome and allows for differential regulation resulting in the wide range of diseases caused by S. pyogenes102. Most of these genes are virulence or virulence-associated genes which are regulated either directly by CovR binding of their promoter or through SpeB activity downstream of the

CovRS system102. Mutations in this TCS have been demonstrated to result in a switch from the mild skin diseases, such as strep throat and impetigo, to the severe invasive diseases, such as necrotizing fasciitis and streptococcal toxic shock syndrome153.

CovS is the sensor histidine kinase in the protein pair. The protein is predicted to make two transmembrane passes with a putative extracellular sensor domain154 (Figure 25A). The cytoplasmic domain has conserved HAMP, HisKA and HATPase domains which allow for phosphorylation and phosphotransfer to the CovR recognition receptor151,154. CovS is a bifunctional sensor histidine kinase allowing it to act as a kinase or phosphatase for CovR155.

CovS will serve as a phosphatase during environmental stress, thus derepressing cov-regulated

102 genes155. In addition to the natural derepression, some mutants are selected during animal passaging which results in bacteria incapable of CovR phosphorylation103.

Figure 25. CD-HIT shows conserved domains in the intracellular domains of CovS (A) and ArlS (B), but a lack of information about the sensor domains. Predicted dimer interfaces, along with HAMP linker regions, HisKA histidine kinase domains and HATPase_c ATP binding sites are conserved among histidine kinases within their intracellular domains, but neither CovS nor ArlS have conserved domains present in their extracellular domains.

CovR normally binds to the promoter region of numerous genes resulting in their repression at a conserved consensus motif ATTARA155. When phosphorylated at the conserved

D53 of the receiver domain, CovR has a higher affinity for these promoter regions due to dimerization of CovR152,156. In vivo, phosphorylated CovR is the active form157. Mutations in either CovR or CovS result in an increase in transcription of Cov-regulated genes. This is typically associated with an advantage to bacteria during invasive infection, but seems to inhibit their ability to colonize the nasopharynx as they no longer are competitive within the saliva or skin158. It has also been observed that transmission is decreased within covRS deletion mutants158.

103

In vivo selection of CovS mutants

When non-invasive strains of S. pyogenes are passaged through an animal, they become hypervirulent103. This phenomenon is the result of mutations within the covS gene, typically resulting in a truncated protein103. In the case of these truncation mutants, CovR is no longer able to be phosphorylated, resulting in a lack of repression of the virulence genes103. This occurs through a spontaneous mutation within the CovS protein that is selected for in vivo through the ability to escape from neutrophil extracellular traps (NETs)159.

The process for selection is understood as follows: wild-type S. pyogenes colonizes the mammalian host mucosal membranes or skin surface and through abrasion or other break in the surface is inoculated into the bloodstream160. The bacteria remain hidden while CovR and SpeB mediated repression of virulence factors result in minimal antigenicity, thus preventing identification by the immune system102,161. A leaky expression of M protein is recognized by neutrophils circulating in the blood triggering the release of the NET which is a combination of the neutrophil’s DNA and antimicrobial peptides159. Once within the NET, the bacteria will die unless they are able to escape from the DNA bound to their cell wall159. The DNAse Sda1 normally produced by S. pyogenes is able to degrade the DNA in the NET allowing the bacteria to escape159. Sda1 however, is repressed by the CovRS TCS102. Due to this repression, only the bacteria unable to repress Sda1 are able to escape from the NET and survive. These bacteria are also hypervirulent to the host because of the expression of the numerous superantigens and other virulence factors that are not repressed in these animal passaged strains103.

ArlRS

Autolysis related locus, ArlRS, is another TCS within S. epidermidis and its more virulent cousin Staphylococcus aureus141. ArlR, the recognition receptor under the control of

104

ArlS, the sensor histidine kinase are transcribed as an operon101. The operon has some growth phase-dependent expression where it is more highly expressed during the logarithmic growth phase101. SarA and AgrA are both involved in the regulation of ArlRS, increasing the expression of the arl operon when present101. This system is required for the regulation of the ica operon necessary for PIA-dependent biofilm formation, but is also an important protein complex for regulation of autolysis, naturally repressing autolysis in the presence of some stress conditions100.

S. epidermidis with ArlRS mutations only fail in biofilm assembly in vitro and do not struggle in the primary attachment phase of biofilm formation100. Early studies show mutants in ArlRS resulting in a drastic decrease in extracellular proteolytic activity which is suggested to repress the activity of protein-based adhesins for biofilm attachment and accumulation162. ArlRS has also been shown to play an important role in S. aureus virulence factor regulation101.

Factors regulated by ArlRS

While ArlRS has been demonstrated to be integral in virulence and biofilm regulation in both S. aureus and S. epidermidis, most of the work on this protein has been done in S. aureus in terms of identifying regulated genes. In S. aureus, 114 or more genes of the approximately 2,600 genes within the genome have been demonstrated to be under the control of ArlRS either directly or indirectly163,164. Both S. epidermidis proteins are highly conserved as compared to the S. aureus proteins with ArlS being 70% identical and ArlR having an 84% identity165,166. ArlRS has been shown to have an impact on the virulence of S. aureus, modifying the expression of the agr and sarA loci101. ArlRS is quite important in the role of virulence regulation, but both activates some virulence factors and inhibits others, resulting in a variety of diseases163. Many of the regulated genes are indirectly regulated through the activation of the agr operon or the repressor of toxins, rot163. In the absence of ArlS, norA is approximately six-fold higher in expression than

105 in the wild-type167. ArlR binds to the promoter of norA and the putative consensus sequence for

ArlR binding was TTAATT167.

PAS/PDC Folds

Both CovS and ArlS have been demonstrated to contain a PAS/PDC fold (Per-Arnt-Sim)

(Figure 26). PAS domains are specialized domains of signaling or regulatory proteins that respond to a variety of extracellular signals168. Some of these signals are nutritional availability, voltage, gases, light, metabolites or redox potential169. These domains have been seen to be present in virulence regulating TCS previously, including systems necessary for biofilm dispersion. A typical PAS domain is 100 amino acids or longer with five β-strands forming a core β-sheet with a variety of interspersed α-helices to allow for ligand-specificity168.

Extracytoplasmic PAS domains tend to take a different topology than cytoplasmic PAS domains and are thus labeled PDC domains for the founding family members that typify this topology,

PhoQ, DcuS and CitA170.

While widespread amongst sensory proteins, the primary sequence of these domains is incredibly divergent on the order of 20% identity or less within the family. Due to the divergence in amino acid sequence, to identify homologs within this family, one must select algorithms that do not require a high-degree of sequence identity such as the position specific iterative basic local alignment search tool (PSI-BLAST) or secondary structure based alignments such as

PHYRE or FFAS168.

106

Figure 26. Homology models for CovS (A) and ArlS (B) extracellular domains reveal a conserved fold for sensor domains. Predicted structures of these two proteins suggest common domains and overall fold, which could hint at a conserved mechanism of signaling. Additionally, both models show two α-helices at the N- and C- termini of the constructs oriented in the same direction, consistent with two transmembrane passes which is typical in sensor histidine kinase architecture.

In silico characterization

CovS and ArlS have been characterized by their expression profiles and downstream effects; however, due to being membrane proteins, they have poor structural characterization.

Initially, CD-HIT was utilized154,171,172 to determine known domains of the sensor histidine kinase proteins (Figure 25). This analysis showed an important gap in these studies – a lack of information present on the extracellular sensor domain in both proteins.

To address this, homology models have been generated for both the CovS and ArlS extracellular domains containing PAS folds using threading via PHYRE224. Unfortunately, this family of proteins has not been studied extensively via structural biology, resulting in low similarity. Therefore, additional models have been generated through other homology modeling servers, such as SwissModel, with the attempt to find a consensus prediction. These additional

107 predictions reveal a common trend to PAS folds of these sensor domains, increasing the confidence of the predictions (Figure 26).

Figure 27. OMA Browser shows near evolutionary distance between ArlS (Red) and CovS (Blue). Other sensor histidine kinases are present within this tree, some of which have known structures.

108

Additionally, attempts to determine evolutionary distance between CovS and ArlS were made to determine if there were solved structures of other proteins within close evolutionary distance (Figure 27). This analysis revealed a close similarity between CovS and ArlS and had a number of additional sensor histidine kinases present, such as the periplasmic sensor domain of

CpxA from Vibrio parahaemolyticus. Using these additional structures could be further utilized to improve upon the existing homology models.

From the generated homology models, predicted ligand binding sites were determined by mapping binding sites from homologs and largest cavity mapping (Figure 28)173. At the time of this analysis, the binding partner was unknown, but CovS has since been determined to interact with the human antimicrobial LL-37 peptide174. Based on this knowledge, some predictions can be made about the mode of binding. Since the deepest cavity is very well conserved amongst the various models, it seems reasonable to predict that this is the likely ligand binding site.

A B C D

Figure 28. Homology-based (A-C) and cavitation based (D) identification of potential ligand binding sites (red) in the CovS extracellular domain. Predicted (A) Ca2+ binding site, (B) Heme binding and (C) Mg2+ binding based on homologous proteins. (D) Largest cavity based on surface cavitation via CASTp.

109

In vitro characterization

Homology models were then used as a guide for construct design of CovS and ArlS extracellular domains. Initial attempts at characterization were unsuccessful due to massive aggregation of CovS or ArlS upon expression. To address this, refolding was utilized during purification of these proteins. This refolding allowed for a small quantity of CovS and ArlS to be purified, enough to perform initial circular dichroism (CD) (Figure 29) and analytical ultracentrifugation (AUC) experiments (Figure 30) that verified proper folding and molecular weight.

10 20C

80C

5

0 200 210 220 230 240 250 260

-5

-10 CD Signal Signal CD (MachineUnits)

-15

-20 Wavelength (nm)

Figure 29. Circular dichroism of refolded CovS suggests properly folded protein that is consistent with a mixture of α-helix and β-sheet. CovS appears to maintain an α/β fold at 20℃ (blue) and loses its fold with increasing temperature to 80℃ (red).

110

Figure 30. Sedimentation velocity AUC of refolded CovS shows a single species with predicted size of about 14kDa. Fitted raw data (A) show high quality fits for a single species sedimentation coefficient of ~1.6S, which is consistent with a protein of approximately 14kDa.

111

Due to low expression yields after a long purification protocol, a more efficient purification procedure was attempted. To this end, a pCOLD vector, kindly provided by Dr.

Michael Wessels was utilized in a modified expression system. This construct relies on the cold chaperones to assist in folding of the protein as it is being produced. Using CovS-MBP purified from this expression system, interaction with LL37 was tested by AUC to determine if binding of

LL37 induced dimerization (Figure 31). Under these conditions, dimerization was not observed, but this could be due to the MBP tag present on the purified protein. To test this further, cleaved protein will need to be obtained through this expression system.

~49.2kDa

Figure 31. Sedimentation velocity reveals a peak consistent with monomeric CovS-MBP in the presence or absence of LL37 antimicrobial peptide. Running sedimentation velocity AUC on CovS-MBP with LL37 results in no shift in sedimentation coefficient, suggesting that CovS-MBP does not dimerize in the presence of a 1:1 ratio of LL37; however, the mass of LL37 is too small to effectively determine if interaction is occurring.

112

Conclusions

Gram positive organisms cause a wide range of diseases, resultant from a vast array of virulence factors. These diseases can be mild or severe but require an abundance of regulatory mechanisms to tightly control. S. pyogenes is seen as a pathogen in most cases with a switching mechanism between mild skin infection and severe invasive disease which seems to be dependent on a single mutation which is selected for in vivo. S. epidermidis on the other hand, is an organism that exists as normal flora on the human population and is rarely seen as a dedicated pathogen. Instead, these bacteria are associated with opportunistic infections typically related to the implantation of a medical device.

TCS are invaluable to the regulation of virulence within Gram positive organisms. ArlRS and CovRS are global regulators that regulate 5% or more of their respective genomes. These systems appear to be similar structurally, falling within the same PAS fold, and regulate some of the same factors (i.e. capsule production and extracellular protease activity). ArlRS and CovRS, however, are generally poorly studied when it comes to the actual mechanism of recognition of the extracellular ligand required to induce this broad regulation of virulence genes. Due to the importance in virulence regulation of these systems, they could make great targets for small molecule inhibition. To achieve this, some characterization has occurred, but more will be necessary to develop a more accurate model and/or solved structure for virtual screening of small molecule libraries.

113

Appendix II: Hiding in Plain Site: Immune Evasion by the Staphylococcal Protein SdrE

Andrew B. Herr, Alexander W. Thorman

Biochemical Journal. May 10, 2017. 474 (11) 1803-1806. DOI: 10.1042/BCJ20170132

114

115

116

117

118

Appendix III: Running AutoDock Tools and AutoDock4.2.6

119

Appendix III was modified from the tutorial at:

http://autodock.scripps.edu/faqs-help/tutorial/using-autodock-4-with-autodocktools/2012_ADTtut.pdf175

If you wish to run AutoDock as a high throughput system on the CCHMC cluster,

Click here

Connecting to the Cluster:

1. Connect to the CCHMC cluster using Citrix

a. connect.research.cchmc.org

b. Type password when connected

2. Once connected to the cluster type bsub –Is bash

a. Type “module load mgltools[tab complete]”

b. Type “module load autodock[tab complete]”

c. Type “Adt” to load the program

120

Menu Bar

Tool Bar

Dash Board

3D viewer

Info Bar

Figure 32. AutoDockTools Interface. The interface for AutoDockTools used in the preparation of receptors and grid boxes prior to running AutoDock4.2.6. This interface can also be used to run low throughput docking experiments.

121

Running AutoDock Locally

Preparing Protein (Receptor)

1. Open the program “AutoDockTools”

a. This will open a graphical interface

2. File  Read Molecule  Molecule of interest  Open

a. At this time, the name of your protein should open under “Current Session”

i. In this example we will use A1_2VOH_A

3. Using the inverted triangle under “Cl” in the dashboard, you can color the atoms of your

molecule by atom type

4. Remove water molecules from file:

a. Click Select in the menu bar  click “Select From String”

i. This will open another panel

1. In this panel, type HOH* in the “Residues:” box and “*” in the

“Atoms:” box  click “Add”  Click Dismiss

ii. Click “Edit” in the Menu Bar  Select the “Delete” dropdown  Click

“Delete Selected Atoms”

1. At this point you will get a warning message because the deletion

of nodes cannot be undone, accept this warning by clicking

“Continue”

5. Add Hydrogens to your structure file:

a. Click “Edit” in the Menu Bar  Select the “Hydrogens” dropdown  Click

“Add”

i. This will open another panel

122

1. Select “All Hydrogens”, “noBondOrder” and “yes”, then click

“OK”

Preparing Ligand

1. Click “Ligand” on the Brown Bar under the Tool Bar  Select the “Input” Dropdown 

Click “Open” and select your ligand file

a. This file may need to be downloaded in Mol2, PDB or PDBQT format

i. The ZINC database has a large number of these ligands in Mol2 format

b. Acknowledge the summary by clicking “OK”

2. Click the “Ligand” dropdown again  Select the “Torsion Tree” dropdown and select

“Detect Root…”

a. Click the “Ligand” dropdown again  Select the “Torsion Tree” dropdown and

select “Choose Torsions”

i. This will tell you which bonds are considered rotatable, Autodock can

handle up to 32 active bonds, but the fewer present there are, the faster the

program will run

1. This value can be decreased by Click the “Ligand” dropdown

again  Select the “Torsion Tree” dropdown and select “Set

Number of Torsions..”

3. Save your prepared ligand as a pdbqt file:

a. Click the “Ligand” dropdown again  Select the “Output” dropdown and select

“Save as PDBQT”

123

Finish Preparing the Macromolecule

1. Click “Grid” on the Brown Bar under the Tool Bar  Select the “Macromolecule”

Dropdown  Click “Choose…” and select your receptor

a. Save the file as a PDBQT file to your folder

Defining the Grid Box

1. Click “Grid” on the Brown Bar under the Tool Bar  Click on “Grid Box…”

a. Set the values to be centered on your pocket of interest in the “Center Grid Box:”

boxes

i. An easy way to identify the location of this pocket is to look in the PDB

file of your receptor for coordinates of residues in the area of the pocket

ii. If you are unsure of where your protein has a binding pocket, you can use

SPPIDER to identify known or predicted protein-protein interaction sites

iii. These parameters can be negative as well

b. Define your search space size by altering the number of points in each dimension

i. A visual representation of this will be drawn in your 3D viewer

ii. Keep in mind the more points present in this search space the longer the

program will run

c. When your box is in your binding pocket, click “File” followed by “Close saving

current”

124

Prepare AutoGrid

1. Click “Grid” on the Brown Bar under the Tool Bar  Select the “Set Map Types”

Dropdown  Click “Choose Ligand”

a. Click on the prepared ligand file and select ligand

2. Optionally, flexible receptor residues can be added by GridSet Map TypesChoose

FlexRes

3. Set AutoGrid parameter file:

a. Click “Grid” on the Brown Bar under the Tool Bar  Select the “Output”

Dropdown  Click “Save GPF…” and name the file in the .gpf format, then click

“Save”

Running AutoGrid

1. First, make sure that all of your files thus far are in the same folder. You should have

autodock4 and autogrid4, along with your pdbqt files for both your receptor and ligand.

You should also have a .gpf file for your grid.

a. Once you see all of these files in the same folder, set this folder as your default

working folder by clicking “File”  “Preferences”  “Set”, which will open a

new tab

i. Under startup directory, select the folder your files are all in and click

“Make Default” and “Set” followed by “Dismiss”

125

Figure 33. AutoDockTools User Preference Tab. In this tab, the user must specify the startup directory as the working directory that includes all files related to AutoDock.

126

2. Once the files are all in the same place and your default startup directory is selected, you

are ready to run AutoGrid4

a. Click “Run” in the brown bar, followed by “Run AutoGrid”. This will open a new

window in which you need to make sure the working directory is the startup

directory you just set in step 1. After this under “Program Pathname:” browse for

the executable file “autogrid4”

b. Next browse for your .gpf file prepared in previous steps in the “Parameter

Filename” box. Once selected, this should automatically fill in the “Log

Filename” box to produce a .glg file.

c. Once both paths are selected, click the launch button. This should take a minute or

two and produce a number of files in the same directory. To double check if the

program successfully executed, you can look in the .glg file produced. The last

line of this file should read something like:

“C:/Users/Alex/Desktop/LabStuff/Bcl2Family/AutoDock4.2.6/autogrid4.exe:

Successful Completion.

Real= 20.22s, CPU= 19.97s, System= 0.09s”

Preparing AutoDock Parameter File and Running AutoDock

1. Specify gridmap filename stem:

a. Click “Docking”  Select the “Macromolecule” dropdown menu  Click “Set

Rigid Filename…”  Select your receptor pdbqt file and click “Open”

2. Specify your ligand of interest:

127

a. Click “Docking”  Select the “Ligand” dropdown menu  Click “Choose…” 

Select your ligand of interest and click “Select Ligand” followed by “Accept”

3. Set your search method:

a. Click “Docking”  Select the “SearchParameters” dropdown menu  Click

“Genetic Algorithm…”  Here you can change the number of evals per run.

Harder problems should have more number of evals allowed.  click “Accept”

4. Set your genetic algorithm:

a. Click “Docking”  Select the “Output” dropdown menu  Click “Lamarckian

GA…”  Select your ligand of interest and click “Save”

5. Run the program:

a. Click “Run”  “Click “Run AutoDock…”

i. This will open a new window. Make sure your working directory is the

default set before running autogrid4

ii. Browse in the “Program Path” box to find autodock4.exe

iii. Browse in the “Parameter Filename” box and find your .dpf file named for

the ligand of interest. This should automatically generate a “Log

Filename” in the box below.

iv. Click “Launch”. This should take a few minutes and will be complete

when the .dlg file has a final line that looks something like this:

“C:/Users/Alex/Desktop/LabStuff/Bcl2Family/AutoDock4.2.6/autodock4.

exe: Successful Completion on "ALEX-LAPTOP"

128

Real= 11m 07.40s, CPU= 3m 55.27s, System= 0.17s”

6. Visualizing the docking:

a. Click “Analyze”  Select the “Dockings” dropdown menu  Click “Open…” 

Select your .dlg file and click “Open”

b. Click “Analyze”  Select the “Conformations” dropdown and click “Load”

c. Click “Analyze”  Select the “Conformations” dropdown menu  Click “Play,

ranked by energy…”

d. Click “Analyze”  Select the “Macromolecules” dropdown menu  Click

“Open…”

i. Select your receptor.pdbqt file and click “Open”

ii. At this point, you can display your receptor as a molecular surface by

clicking the circle under MS in the dashboard (in line with your protein)

e. An isocontour of your binding site within your gridbox can be displayed by

clicking “Analyze”  “Grids”  “Open…” and then opening your

receptor.OA.map file

i. Once opened, set Sampling to 1

ii. You can also map residues within the pocket as sticks and balls by

clicking “Select”  “Select From String”  “Residues:” and adding your

known residues

1. After adding your residues click “Display”  “Sticks and Balls”

 “Ok”

f. To see the center of each docked result (for clustering of docking) click

“Analyze”  “Dockings”  “Show as Spheres…”

129

Running AutoDock on the cluster (high throughput)

A special thanks to Jacek Biesiada for adapting the AutoDock scripts for high-throughput screening on the CCHMC cluster.

1. Connect to the cluster using Citrix Receiver

a. connect.research.cchmc.org as the HTML

b. Type password when connected

2. Once connected to the cluster type bsub –Is –M (memory needed) –W (walltime

needed) bash in the terminal to reserve time on a single node

a. Type “module load mgltools[tab complete]”

i. The tab complete will help identify the most recent version available on

the cluster

b. Type “module load autodock[tab complete]”

i. The tab complete will help identify the most recent version available on

the cluster

c. Type “Adt” to load the AutoDockTools program

3. There should be a file in your docking directory titled “adscr.cfg”. This is the autodock

screen configuration file. It has all data paths listed within it and needs to be modified to

the location of your directories.

4. prepare_ligand.pl (Only if your ligands are not already part of a cleaned up library)

a. Uses ligand name and autodock version as inputs

b. Make sure the path is correct, there should be two lines that read:

$pathsh=”/usr/local/MGLTools/MGLTools-1.5.6/bin”;

130

$pathsk=”/usr/local/MGLTools/MGLTools-

1.5.6/MGLToolsPckgs/AutoDockTools/Utilities24”;

c. Then type “prepare_ligand.pl –l (ligand name) –v 4

5. prepare_receptor.pl (If the receptor was not already prepared using AutoDockTools)

a. Uses receptor name and autodock version as inputs

b. Prepare_receptor.pl -r (receptor name) –v 4

6. prepare_gpf.pl (If the Grid Parameter File was not already prepared using

AutoDockTools)

a. Uses ligand file, and autodock version as inputs

b. Make sure the path is correct, there should be two lines that read:

$pathsh=”/usr/local/MGLTools/MGLTools-1.5.6/bin”;

$pathsk=”/usr/local/MGLTools/MGLTools-

1.5.6/MGLToolsPckgs/AutoDockTools/Utilities24”;

c. Type “prepare_gpf.pl –r (receptor name) –l (ligand name) –v 4”

7. Prepare_dpf.pl

a. Uses ligand, receptor and autodock version

b. Make sure the path is correct, there should be two lines that read:

$pathsh=”/usr/local/MGLTools/MGLTools-1.5.6/bin”;

$pathsk=”/usr/local/MGLTools/MGLTools-

1.5.6/MGLToolsPckgs/AutoDockTools/Utilities24”;

a. Type “prepare_dpf.pl –r (receptor name) –l (ligand name) –v 4”

131

library Queue_2VOH_A1.sh Protein.cfg # of nodes

Queue.pl Adscr.cfg

Increase Sublibraries num_evals generated Run_library_sublibrary.pl Structural model Estimated Estimated ΔG Ki Adscr.pl

Ad_lib Number # for next run .dlg files if .err evals bzcat specified

Ad_lib Bestki_median.pl .out Sublib_allki

Sublib_cl Index of top X ligands Sublib_ allbinding

Figure 34. Flow chart representation of running AutoDock in an iterative fashion on a high-throughput screening library. Any green ovals are needed as inputs for AutoDock while those in light blue are potential outputs. The red triangles are logs of any potential errors that may arise and in orange is a value that will be an output of early iterations and an input of later iterations.

132

8. queue.pl (This is the primary script to run)

a. needs number of nodes, proteincfg and hash_ligand files as inputs

b. Make sure your proteincfg file is formatted to your protein of interest

i. This is where you can define your gridbox dimensions

c. Change the queue.sh file to use the number of nodes, protein configuration file

and library to the ones you want to run

d. Run the queue_protein.sh shell to run autodock. This will generate a number of

files about how libraries were broken up in your current directory, along with tmp

files in the tmp directory specified in the adscr.cfg file. Additionally, your result

files will be dumped to the result folder specified by the adscr.cfg file.

e. This will execute the script that will produce .dlg files if you specified them in the

adscr.cfg file. This should be done only for your last run, as otherwise you

will produce a huge number of files, maxing out your cluster storage space.

9. Determining the best inhibitors:

a. As an output of the adscr.pl script, you will generate a file for each node used

called LIBRARY_node_allki.bz2

b. These zipped files need to be concatenated to generate a single file with all of the

Ki values for all inhibitors, to do this type bzcat LIBRARY_*_allki.bz2

c. Once concatenated, run the bestki_median.pl script using the concatenated file,

the number of ga evaluations (default is 50) and the number of inhibitors you wish

to use in the next iteration of autodock (X).

i. This script will output an index file that includes the top X inhibitors. Use

this file as your library file in the next iteration. Change the library file in

133

your queue_protein.sh file to the output of bestki_median.pl which you

specify. Also change the number of evaluations you want to run on the

next iteration in the adscr.cfg file. This number should increase as you

progress in your iterations, giving more accurate results.

ii. Bestki_median.pl –f “FILENAME” –g 50 –s “new subset size”

iii. At the end of simulations, this script can be run with the whole subset as

your new subset size to get a table with information on individual Kis.

Simply copy this whole table into an excel sheet and convert into separate

columns through tab/space delimited columns.

10. Saving .dlg files:

a. On your final iteration of autodock, you will want to save your .dlg files for your

final results. To do this, go into the adscr.cfg file and change the “savedlgfiles” to

yes.

b. Then run autodock for the final time, saving all results.

11. Each iteration of AutoDock should increase the search depth while decreasing the total

number of compounds.

a. For previous runs of AutoDock, the full library used 250,000 evaluations.

b. The top 30,000 compounds used 1,000,000 evaluations.

c. The top 3,000 of those compounds used 10,000,000 evaluations.

12. If running an ensemble docking, use information from runs of each receptor to construct

your next iteration.

134 a. In order to Merge (union) files but remove repeated lines: b. Cat File1 File2 > OutputFile c. Sort –u OutputFile > OutputFile_sorted

Or a. To get the intersection of two files: b. Grep –xF –f File1 File2 > OutputFile_intersect

135

Appendix IV: Running Sig2Lead

136

Sig2Lead is currently available as an R Shiny application through GitHub at https://github.com/thormaaw/Sig2Lead.R. Additionally, the early python code is available at this location that will be used to build the web interface. Prior to use of the application, RStudio and

R must be installed (https://www.rstudio.com/products/rstudio/download/). Upon download, all dependencies must also be installed in order to run the program (Figure 35). These dependency libraries include: ChemmineR, gplots, cluster, ggplot2, shiny, httr, jsonlite, DT, ChemmineOB, plyr, dendextend, colorspace, ggforce, rlist, scatterpie, ggrepel, visNetwork, bazar, XML, RCurl, bitops igraph and plotly. Unfortunately, Sig2Lead relies heavily on the ChemmineOB package which is unavailable on MacOS. For this reason, a web interface will be developed as described in Chapter IV, which will allow access to the program on MacOS. To install the Bioconductor packages (ChemmineR and ChemmineOB), type into the console:

Source(“https://bioconductor.org/biocLite.R”)

biocLite(“ChemmineR”)

biocLite(“ChemmineOB”)

Sig2Lead is downloaded as a directory which includes two levels: Sig2LeadShiny and lib. Under Sig2LeadShiny, there should be three files: ui, server and LINCSCompounds, along with the lib and similarity search subdirectories. The lib folder includes all scripts necessary for running within the ui and server of the application itself. When downloading, make sure this structure is preserved, as this is the structure the application internally references. Once installed, set your working directory in RStudio to the Sig2LeadShiny directory as described in Figure 36.

137

Figure 35. Installation of Packages in R Studio. To install packages required for running Sig2Lead, click on the “Packages” tab on the right side of the interface. Under this tab, click “Install” and then type the package for installation. For installation of Bioconductor packages (ChemmineR and ChemmineOB), the biocLite command is required in the console.

138

Figure 36. Setting Sig2LeadShiny as the working directory in RStudio. To set a working directory, under the “Files” tab, click the “…” button highlighted in red and navigate to the Sig2LeadShiny folder. Then, click the “More” dropdown and select “Set As Working Directory.” Once the working directory is set, select “Run App” highlighted in orange to open the Sig2Lead application.

139

Figure 37. Sig2Lead landing/search page. On this page, the user inputs target genes of interest and optionally compounds in SMILES format for inclusion in clustering steps of Sig2Lead analysis. Once inputting a gene target and optional SMILES compounds, click the “Go!” button and navigate to the “LINCS Compounds” tab to await results. Alternatively, users can upload a user defined profile that has overlap with the L1000 genes by changing the “Input a Gene” dropdown to “Upload a Signature.” Additionally, the LINCS data can be queried for concordant (inhibitor) or discordant (activator) profiles as compared to the user defined gene knockdown or user defined signature.

140

Once all packages are installed and the working directory is set, the application is ready to run. Open either ui or server in R Studio and select Run App, depicted in Figure 36. Once open, input the gene target of interest into the “Input a Gene” box and optionally include a list of

SMILES of compounds to include in the clustering steps. Once these inputs are correct, click

“Go!” and then click the “LINCS Compounds” tab (Figure 37). At this point, the application will print to the R Studio interface each step that has been completed. The whole process will take a few minutes and results will appear on the “LINCS Compounds” tab and “Heatmap” tab once complete. The current interface requires the user to reboot the application if they wish to alter their run in any way.

Once the compounds have been collected from iLINCS and clustering is complete, the user has a number of options available to them. First, users can download the complete list of

LINCS compounds with concordant signatures to a knockdown in their target gene in either the

SMILES or SDF format by clicking the respective button on the “LINCS Compounds” tab

(Figure 38). Next, users can navigate to the “Heatmap” tab, which displays all compounds hierarchically clustered by Tanimoto distance, including any compounds optionally added on the search page in magenta (Figure 39).

141

Figure 38. LINCS Compounds Tab for display of all LINCS compounds with concordant expression signatures to user-defined target gene. This tab displays the output of an iLINCS query, along with the SMILES code for all LINCS compounds with concordant signatures to the target gene. These compounds can be downloaded in either SMILES or SDF format, allowing the user to have structural information about all relevant compounds. Data for LSM compounds are available by clicking the hyperlinked LSM IDs, redirecting the user to the LINCS data portal entry for the compound of interest.

142

Figure 39. Heatmap tab of Sig2Lead for viewing chemical similarity of LINCS small molecules with concordance to a knockdown of a user-defined target gene along with any user defined small molecules (magenta). This tab allows users to visualize a heatmap of hierarchically clustered small molecules derived from LINCS and user defined compounds if applicable. From this page, representatives and MDS plot can be generated by providing a Tanimoto similarity cutoff and minimum cluster size and clicking “Get MDS Plot.”

143

Figure 40. MDS Plot tab of Sig2Lead for an alternative view of hierarchically clustered small molecules derived from concordant signatures to easily view added compounds. MDS displays relative Tanimoto distance of each representative compound to one another with pie chart radius and cluster order correlated to cluster size. Blue in pie charts above correspond to compounds with concordant signatures to knockdowns of the target gene in LINCS. Red in the pie charts above are compounds added by the user, showing structural similarity to others within the same cluster. These representatives can be downloaded with the “Download Representatives” button and the whole clusters can be downloaded with the “Download Clusters” button at the bottom of the page. Related NCI compounds can be identified on the “Get Related NCI Compounds” button, which will identify compounds structurally related to each cluster present within the NCI compound library.

144

Additionally, the user can retrieve an MDS plot and identify cluster centroids as representative compounds by inputting the Tanimoto threshold and minimum cluster size on the

“Heatmap” tab and clicking “Get MDS Plot”. This step will identify the representative compounds by finding the compound within each cluster, identified by cutting the dendrogram at the Tanimoto similarity specified by the user, with the minimum distance to each other member of the cluster. Only clusters with at least as many compounds as specified by the user in the minimum cluster size, with a minimum of two, box will be included in the MDS plot. These representatives will be provided and be able to be downloaded as a Comma Separated Values

(CSV) file list of representatives. Additionally, the compounds that comprise each cluster can be downloaded as a CSV file from this page (Figure 40).

Once the user has plotted the MDS of these clusters, a list of representatives can be downloaded as a starting point in the drug discovery pipeline. These compounds will provide structurally distinct compounds that very likely target various members of the pathway of interest. For users that desire to use Sig2Lead as an initial screening step before high-throughput or virtual screening, all LINCS-derived compounds are downloadable in both SDF and SMILES formats. These files can be further processed to prepare for virtual screening or searched in chemical databases for library derivation.

Furthermore, similar compounds within the NCI database can be identified through the

“Get Related NCI Compounds” button at the bottom of the MDS Plot tab and then navigate to the “Similar Compounds” tab (Figure 41). This function allows the user to identify compounds structurally related to each cluster that could be ordered through the NCI for early validation studies. This type of similarity searching will be expanded to include FDA approved drugs and

NCI compounds related to all compounds, instead of only microcluster centroids.

145

Figure 41. The Similar Compounds tab shows NSC compounds with similar chemical structures to the centroids of each cluster. Using this feature, users are able to identify compounds within the NCI library that can be ordered and tested in vitro. This library can be ordered for only the shipping cost, allowing researchers to generate preliminary data for projects without the high costs in ordering compounds.

146

After clustering, users can switch to the “STITCH Network” tab to visualize the target of interest and its known interactions (Figure 42). This analysis allows users to determine if compounds in a given cluster are already known inhibitors of pathway members. Users can change the connectivity of this network to increase the complexity of the network, thus incorporating more genes known to associate within a given pathway. Additionally, users can query other genes from their initial search to look at other steps within a pathway, or other pathways that are known to drive similar expression profiles.

147

Figure 42. STITCH Network Interface shows known interactions reported in the literature. Using the “STITCH Network” tab, users can search their gene of interest, along with compound clusters identified via Sig2Lead to see any known interactions reported in the literature. The yellow circle signifies the query gene under the “Gene of Interest” field, and red ovals represent added compounds which are associated with selected gene or others within the same pathway. Blue ovals represent other compounds within the specified cluster that were not known to associate with the network, but are structurally related to others that may be in the network (such as the compounds in red). Gray circles are other genes identified by STITCH that fall in a pathway with the gene of interest. The thickness of the edges of the network shows confidence of a given association with thicker lines representing more confident associations. This map is interactive and users can change the cluster or gene of interest for updating with the same clusters identified within the previous analysis.

148

References

1. Opap, K. & Mulder, N. Recent advances in predicting gene–disease associations.

F1000Res 6, (2017).

2. Reilly, M. P. et al. Identification of ADAMTS7 as a novel locus for coronary atherosclerosis and association of ABO with myocardial infarction in the presence of coronary atherosclerosis: two genome-wide association studies. Lancet 377, 383–392 (2011).

3. Costa, V., Aprile, M., Esposito, R. & Ciccodicola, A. RNA-Seq and human complex diseases: recent accomplishments and future perspectives. European Journal of Human Genetics

21, 134–142 (2013).

4. Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics 14, 128 (2013).

5. Carrington, E. M. et al. Anti-apoptotic proteins BCL-2, MCL-1 and A1 summate collectively to maintain survival of immune cell populations both in vitro and in vivo. Cell Death

Differ. 24, 878–888 (2017).

6. Mensink, M. et al. Anti-apoptotic A1 is not essential for lymphoma development in Eµ-

Myc mice but helps sustain transplanted Eµ-Myc tumour cells. Cell Death Differ. 25, 795–806

(2018).

7. Tuzlak, S. et al. The BCL-2 pro-survival protein A1 is dispensable for T cell homeostasis on viral infection. Cell Death Differ 24, 523–533 (2017).

149

8. Wang, C.-Y., Guttridge, D. C., Mayo, M. W. & Baldwin, A. S. NF-κB Induces

Expression of the Bcl-2 Homologue A1/Bfl-1 To Preferentially Suppress Chemotherapy-Induced

Apoptosis. Mol Cell Biol 19, 5923–5929 (1999).

9. Czabotar, P. E., Lessene, G., Strasser, A. & Adams, J. M. Control of apoptosis by the

BCL-2 protein family: implications for physiology and therapy. Nat. Rev. Mol. Cell Biol. 15, 49–

63 (2014).

10. Bittker, J. A. et al. Discovery of Inhibitors of Anti-Apoptotic Protein A1. in Probe

Reports from the NIH Molecular Libraries Program (National Center for Biotechnology

Information (US), 2010).

11. Ottina, E., Lyberg, K., Sochalska, M., Villunger, A. & Nilsson, G. P. Knockdown of the

Antiapoptotic Bcl-2 Family Member A1/Bfl-1 Protects Mice from Anaphylaxis. J Immunol 194,

1316–1322 (2015).

12. Kamath, K. S., Vasavada, M. S. & Srivastava, S. Proteomic databases and tools to decipher post-translational modifications. Journal of Proteomics 75, 127–144 (2011).

13. Rhodes, G. Crystallography made crystal clear: A guide for users of macromolecular models. (Elsevier Inc., 2006).

14. Cavalli, A., Salvatella, X., Dobson, C. M. & Vendruscolo, M. Protein structure determination from NMR chemical shifts. PNAS 104, 9615–9620 (2007).

15. Kanelis, V., Forman‐Kay, J. D. & Kay, L. E. Multidimensional NMR Methods for

Protein Structure Determination. IUBMB Life 52, 291–302

150

16. Morris, G. M. et al. AutoDock4 and AutoDockTools4: Automated Docking with

Selective Receptor Flexibility. J Comput Chem 30, 2785–2791 (2009).

17. França, T. C. C. Homology modeling: an important tool for the drug discovery. Journal of Biomolecular Structure and Dynamics 33, 1780–1793 (2015).

18. Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N. & Sternberg, M. J. E. The Phyre2 web portal for protein modeling, prediction and analysis. Nature Protocols 10, 845–858 (2015).

19. Waterhouse, A. et al. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. (2018). doi:10.1093/nar/gky427

20. Zhang, Y. I-TASSER server for protein 3D structure prediction. BMC Bioinformatics 9,

40 (2008).

21. Autin, L., Steen, M., Dahlbäck, B. & Villoutreix, B. O. Proposed structural models of the prothrombinase (FXa–FVa) complex. Proteins: Structure, Function, and Bioinformatics 63,

440–450

22. Song, L. et al. Prediction and assignment of function for a divergent N-succinyl amino acid racemase. Nature Chemical Biology 3, 486–491 (2007).

23. Kozakov, D. et al. The ClusPro web server for protein–protein docking. Nature Protocols

12, 255–278 (2017).

24. Porollo, A. & Meller, J. Prediction-based fingerprints of protein–protein interactions.

Proteins: Structure, Function, and Bioinformatics 66, 630–645

151

25. Vajda, S. & Kozakov, D. Convergence and combination of methods in protein–protein docking. Current Opinion in Structural Biology 19, 164–170 (2009).

26. Huey, R., Morris, G. M., Olson, A. J. & Goodsell, D. S. A semiempirical free energy force field with charge-based desolvation. Journal of 28, 1145–1152

27. Morris, G. M. et al. Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. Journal of Computational Chemistry 19, 1639–1662

28. Biesiada, J., Porollo, A. & Meller, J. On Setting Up and Assessing Docking Simulations for Virtual Screening. in Rational Drug Design 1–16 (Humana Press, Totowa, NJ, 2012). doi:10.1007/978-1-62703-008-3_1

29. London, N. et al. Covalent docking of large libraries for the discovery of chemical probes. Nature Chemical Biology 10, 1066–1072 (2014).

30. Franklin, J., Koehl, P., Doniach, S. & Delarue, M. MinActionPath: maximum likelihood trajectory for large-scale structural transitions in a coarse-grained locally harmonic energy landscape. Nucleic Acids Res 35, W477–W482 (2007).

31. Andrusier, N., Mashiach, E., Nussinov, R. & Wolfson, H. J. Principles of Flexible

Protein-Protein Docking. Proteins 73, 271–289 (2008).

32. Gaudreault, F. & Najmanovich, R. J. FlexAID: Revisiting Docking on Non-Native-

Complex Structures. J. Chem. Inf. Model. 55, 1323–1336 (2015).

33. Wang, C., Bradley, P. & Baker, D. Protein–Protein Docking with Backbone Flexibility.

Journal of Molecular Biology 373, 503–519 (2007).

152

34. Eren, A. M. et al. Minimum entropy decomposition: Unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences. The ISME Journal 9, 968–979

(2015).

35. SHANNON, C. E. A Mathematical Theory of Communication. 55

36. Haifeng Li, Keshu Zhang & Tao Jiang. Minimum entropy clustering and applications to gene expression analysis. in 136–145 (IEEE, 2004). doi:10.1109/CSB.2004.1332427

37. Porollo, A. & Meller, J. POLYVIEW-MM: web-based platform for animation and analysis of molecular simulations. Nucleic Acids Res 38, W662–W666 (2010).

38. Debouck, C. & Metcalf, B. The Impact of Genomics on Drug Discovery. Annu. Rev.

Pharmacol. Toxicol. 40, 193–208 (2000).

39. Subramanian, A. et al. A Next Generation Connectivity Map: L1000 Platform and the

First 1,000,000 Profiles. Cell 171, 1437-1452.e17 (2017).

40. Koleti, A. et al. Data Portal for the Library of Integrated Network-based Cellular

Signatures (LINCS) program: integrated access to diverse large-scale cellular perturbation response data. Nucleic Acids Res. 46, D558–D566 (2018).

41. Keenan, A. B. et al. The Library of Integrated Network-Based Cellular Signatures NIH

Program: System-Level Cataloging of Human Cells Response to Perturbations. Cell Syst 6, 13–

24 (2018).

42. Bajusz, D., Rácz, A. & Héberger, K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? Journal of Cheminformatics 7, 20 (2015).

153

43. Mardia, K. V. SOME PROPERTIES OF CLASSICAL MULTI-DIMENSIONAL

SCALING. 9 (1978).

44. Buja, A. et al. Data Visualization With Multidimensional Scaling. Journal of

Computational and Graphical Statistics 17, 444–472 (2007).

45. Shuker, S. B., Hajduk, P. J., Meadows, R. P. & Fesik, S. W. Discovering High-Affinity

Ligands for Proteins: 274, 5 (1996).

46. Chung, S., Parker, J. B., Bianchet, M., Amzel, L. M. & Stivers, J. T. Impact of linker strain and flexibility in the design of a fragment-based inhibitor. Nature Chemical Biology 5,

407–413 (2009).

47. Erlanson, D. A. Introduction to Fragment-Based Drug Discovery. in Fragment-Based

Drug Discovery and X-Ray Crystallography 1–32 (Springer, Berlin, Heidelberg, 2011). doi:10.1007/128_2011_180

48. Katiyar, S. P., Malik, V., Kumari, A., Singh, K. & Sundar, D. Fragment-Based Ligand

Designing. in Computational Drug Discovery and Design 123–144 (Humana Press, New York,

NY, 2018). doi:10.1007/978-1-4939-7756-7_8

49. Bian, Y. & Xie, X.-Q. (Sean). Computational Fragment-Based Drug Design: Current

Trends, Strategies, and Applications. AAPS J 20, 59 (2018).

50. Moore, G. J. Designing peptide mimetics. Trends in Pharmacological Sciences 15, 124–

129 (1994).

154

51. Chung, C. Restoring the switch for cancer cell death: Targeting the apoptosis signaling pathway. American Journal of Health-System Pharmacy ajhp170607 (2018). doi:10.2146/ajhp170607

52. van Delft, M. F. et al. The BH3 mimetic ABT-737 targets selective Bcl-2 proteins and efficiently induces apoptosis via Bak/Bax if Mcl-1 is neutralized. Cancer Cell 10, 389–399

(2006).

53. Oltersdorf, T. et al. An inhibitor of Bcl-2 family proteins induces regression of solid tumours. Nature 435, 677–681 (2005).

54. Berberich, I. & Hildeman, D. A. The Bcl2a1 gene cluster finally knocked out: first clues to understanding the enigmatic role of the Bcl-2 protein A1. Cell Death Differ 24, 572–574

(2017).

55. Labi, V., Erlacher, M., Kiessling, S. & Villunger, A. BH3-only proteins in cell death initiation, malignant disease and anticancer therapy. Cell Death and Differentiation 13, 1325–

1338 (2006).

56. Kim, H., Kim, Y.-N., Kim, H. & Kim, C.-W. Oxidative stress attenuates Fas-mediated apoptosis in Jurkat T cell line through Bfl-1 induction. Oncogene 24, 1252–1261 (2005).

57. Hildeman, D. A. et al. Control of Bcl-2 expression by reactive oxygen species. PNAS

100, 15035–15040 (2003).

58. Youle, R. J. & Strasser, A. The BCL-2 protein family: opposing activities that mediate cell death. Nature Reviews Molecular Cell Biology 9, 47–59 (2008).

155

59. Vogler, M. BCL2A1: the underdog in the BCL2 family. Cell Death Differ. 19, 67–74

(2012).

60. Gonzalez, J., Orlofsky, A. & Prystowsky, M. B. A1 is a growth-permissive antiapoptotic factor mediating postactivation survival in T cells. Blood 101, 2679–2685 (2003).

61. Souers, A. J. et al. ABT-199, a potent and selective BCL-2 inhibitor, achieves antitumor activity while sparing platelets. Nat. Med. 19, 202–208 (2013).

62. Presicce, P. et al. IL-1 signaling mediates intrauterine inflammation and chorio-decidua neutrophil recruitment and activation. JCI Insight 3,

63. Harvey, E. P. et al. Crystal Structures of Anti-apoptotic BFL-1 and Its Complex with a

Covalent Stapled Peptide Inhibitor. Structure 26, 153-160.e4 (2018).

64. Smits, C., Czabotar, P. E., Hinds, M. G. & Day, C. L. Structural Plasticity Underpins

Promiscuous Binding of the Prosurvival Protein A1. Structure 16, 818–829 (2008).

65. Jenson, J. M., Ryan, J. A., Grant, R. A., Letai, A. & Keating, A. E. Epistatic mutations in

PUMA BH3 drive an alternate binding mode to potently and selectively inhibit anti-apoptotic

Bfl-1. eLife Sciences 6, e25541 (2017).

66. Zhai, D., Jin, C., Huang, Z., Satterthwait, A. C. & Reed, J. C. Differential Regulation of

Bax and Bak by Anti-apoptotic Bcl-2 Family Proteins Bcl-B and Mcl-1. J. Biol. Chem. 283,

9580–9586 (2008).

67. Vogler, M. et al. Concurrent up-regulation of BCL-XL and BCL2A1 induces approximately 1000-fold resistance to ABT-737 in chronic lymphocytic leukemia. Blood 113,

4403–4413 (2009).

156

68. Cosconati, S. et al. Virtual Screening with AutoDock: Theory and Practice. Expert Opin

Drug Discov 5, 597–607 (2010).

69. Forli, S. & Olson, A. J. A forcefield with discrete displaceable waters and desolvation entropy for hydrated ligand docking. J Med Chem 55, 623–638 (2012).

70. Cao, Y., Charisi, A., Cheng, L.-C., Jiang, T. & Girke, T. ChemmineR: a compound mining framework for R. Bioinformatics 24, 1733–1734 (2008).

71. Shapiro, A. B., Walkup, G. K. & Keating, T. A. Correction for interference by test samples in high-throughput assays. J Biomol Screen 14, 1008–1016 (2009).

72. O’Boyle, N. M. et al. : An open chemical toolbox. Journal of

Cheminformatics 3, 33 (2011).

73. Owicki, J. C. Fluorescence Polarization and Anisotropy in High Throughput Screening:

Perspectives and Primer. J Biomol Screen 5, 297–306 (2000).

74. Tallarida, R. J. Quantitative Methods for Assessing Drug Synergism. Genes Cancer 2,

1003–1008 (2011).

75. Tulloch, L. B. et al. Direct and indirect approaches to identify drug modes of action.

IUBMB Life 70, 9–22 (2018).

76. Matthews, H., Hanison, J. & Nirmalan, N. “Omics”-Informed Drug and Biomarker

Discovery: Opportunities, Challenges and Future Perspectives. Proteomes 4, (2016).

157

77. Cheng, T., Li, Q., Wang, Y. & Bryant, S. H. Identifying Compound-Target Associations by Combining Bioactivity Profile Similarity Search and Public Databases Mining. J Chem Inf

Model 51, 2440–2448 (2011).

78. Lamb, J. et al. The Connectivity Map: Using Gene-Expression Signatures to Connect

Small Molecules, Genes, and Disease. Science 313, 1929–1935 (2006).

79. Cheng, T., Liu, Z. & Wang, R. A knowledge-guided strategy for improving the accuracy of scoring functions in binding affinity prediction. BMC Bioinformatics 11, 193 (2010).

80. Fliri, A. F., Loging, W. T., Thadeio, P. F. & Volkmann, R. A. Biological spectra analysis:

Linking biological activity profiles to molecular structure. Proc Natl Acad Sci U S A 102, 261–

266 (2005).

81. Horan, K. & Girke, T. ChemmineOB: R interface to a subset of OpenBabel functionalities. (2017).

82. Wickham, H. et al. ggplot2: Create Elegant Data Visualisations Using the Grammar of

Graphics. (2018).

83. Szklarczyk, D. et al. STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data. Nucleic Acids Research 44, D380–D384 (2016).

84. Mysinger, M. M., Carchia, M., Irwin, J. J. & Shoichet, B. K. Directory of Useful Decoys,

Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking. J. Med. Chem. 55,

6582–6594 (2012).

85. Labbé, C. M. et al. MTiOpenScreen: a web server for structure-based virtual screening.

Nucleic Acids Res 43, W448–W454 (2015).

158

86. Dembitsky, V. M., Gloriozova, T. A. & Poroikov, V. V. Pharmacological and Predicted

Activities of Natural Azo Compounds. Nat Prod Bioprospect 7, 151–169 (2017).

87. Yang, N. J. & Hinner, M. J. Getting Across the Cell Membrane: An Overview for Small

Molecules, Peptides, and Proteins. Methods Mol Biol 1266, 29–53 (2015).

88. Bienstock, R. J. Overview: Fragment-Based Drug Design. in Library Design, Search

Methods, and Applications of Fragment-Based Drug Design (ed. Bienstock, R. J.) 1076, 1–26

(American Chemical Society, 2011).

89. van den Heuvel, M. J. et al. The international validation of a fixed-dose procedure as an alternative to the classical LD50 test. Food and Chemical Toxicology 28, 469–482 (1990).

90. Participants, N. R. C. (US) C. on the U. of T. P. T. R. with H. R. Values and Limitations of Animal Toxicity Data. (National Academies Press (US), 2004).

91. Nair, A. B. & Jacob, S. A simple practice guide for dose conversion between animals and human. J Basic Clin Pharm 7, 27–31 (2016).

92. Trott, O. & Olson, A. J. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of

Computational Chemistry 31, 455–461

93. Grosdidier, A., Zoete, V. & Michielin, O. SwissDock, a protein-small molecule docking web service based on EADock DSS. Nucleic Acids Res 39, W270–W277 (2011).

94. Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: A molecular modeling perspective. Medicinal Research Reviews 16, 3–50

159

95. Irwin, J. J. & Shoichet, B. K. ZINC – A Free Database of Commercially Available

Compounds for Virtual Screening. J Chem Inf Model 45, 177–182 (2005).

96. Levitin, H. M., Yuan, J. & Sims, P. A. Single-Cell Transcriptomic Analysis of Tumor

Heterogeneity. Trends in Cancer 4, 264–268 (2018).

97. Wetterstrand, K. DNA Sequencing Costs: Data. National Human Genome Research

Institute (NHGRI) Available at: https://www.genome.gov/27541954/dna-sequencing-costs-data/.

(Accessed: 2nd August 2018)

98. Cunningham, M. W. Pathogenesis of Group A Streptococcal Infections. Clinical

Microbiology Reviews 13, 470–511 (2000).

99. Darouiche, R. O. Treatment of Infections Associated with Surgical Implants. New

England Journal of Medicine 350, 1422–1429 (2004).

100. Wu, Y. et al. The Two-Component Signal Transduction System ArlRS Regulates

Staphylococcus epidermidis Biofilm Formation in an ica-Dependent Manner. PLoS ONE 7, e40041 (2012).

101. Fournier, B., Klier, A. & Rapoport, G. The two-component system ArlS-ArlR is a regulator of virulence gene expression in Staphylococcus aureus. Molecular Microbiology 41,

247–261 (2001).

102. Sumby, P., Whitney, A. R., Graviss, E. A., DeLeo, F. R. & Musser, J. M. Genome-Wide

Analysis of Group A Streptococci Reveals a Mutation That Modulates Global Phenotype and

Disease Specificity. PLoS Pathogens 2, e5 (2006).

160

103. Aziz, R. K. et al. Microevolution of Group A Streptococci In Vivo: Capturing Regulatory

Networks Engaged in Sociomicrobiology, Niche Adaptation, and Hypervirulence. PLoS ONE 5, e9798 (2010).

104. Kasper, K. J. et al. Bacterial Superantigens Promote Acute Nasopharyngeal Infection by

Streptococcus pyogenes in a Human MHC Class II-Dependent Manner. PLoS Pathogens 10, e1004155 (2014).

105. Sundberg, E. J., Deng, L. & Mariuzza, R. A. TCR recognition of peptide/MHC class II complexes and superantigens. Seminars in Immunology 19, 262–271 (2007).

106. Watanabe-Ohnishi, R. et al. Selective Depletion Of V -Bearing T Cells In Patients With

Severe Invasive Group A Streptococcal Infections And Streptococcal Toxic Shock Syndrome.

Journal of Infectious Diseases 171, 74–84 (1995).

107. Li, J. et al. Neutrophils Select Hypervirulent CovRS Mutants of M1T1 Group A

Streptococcus during Subcutaneous Infection of Mice. Infection and Immunity 82, 1579–1590

(2014).

108. Påhlman, L. I. et al. Soluble M1 protein of Streptococcus pyogenes triggers potent T cell activation. Cellular Microbiology 0, 070928215112001-??? (2007).

109. Levin, J. C. & Wessels, M. R. Identification ofcsrR/csrS, a genetic locus that regulates hyaluronic acid capsule synthesis in group AStreptococcus. Molecular Microbiology 30, 209–

219 (1998).

161

110. Nelson, D. C., Garbe, J. & Collin, M. Cysteine proteinase SpeB from Streptococcus pyogenes – a potent modifier of immunologically important host and bacterial proteins.

Biological Chemistry 392, (2011).

111. Chella Krishnan, K., Mukundan, S., Landero Figueroa, J. A., Caruso, J. A. & Kotb, M.

Metal-Mediated Modulation of Streptococcal Cysteine Protease Activity and Its Biological

Implications. Infection and Immunity 82, 2992–3001 (2014).

112. Holm, S. E., Norrby, A., Bergholm, A.-M. & Norgren, M. Aspects of Pathogenesis of

Serious Group A Streptococcal Infections in Sweden, 1988-1989. Journal of Infectious Diseases

166, 31–37 (1992).

113. Kapur, V. et al. Vaccination with streptococcal extracellular cysteine protease

(interleukin-1β convertase) protects mice against challenge with heterologous group A streptococci. Microbial Pathogenesis 16, 443–450 (1994).

114. Misiakos, E. P. et al. Current Concepts in the Management of Necrotizing Fasciitis.

Frontiers in Surgery 1, (2014).

115. Watanabe, A. et al. Corrigendum to “Nationwide surveillance of bacterial respiratory pathogens conducted by the Surveillance Committee of Japanese Society of Chemotherapy,

Japanese Association for Infectious Diseases, and Japanese Society for Clinical Microbiology in

2009: General view of the pathogens’’ antibacterial susceptibility” [J Infect Chemother 18

(2012) 609–620]”. Journal of Infection and Chemotherapy 21, 78–80 (2015).

116. Kloos, W. E. Natural Populations of the Genus Staphylococcus. Annual Review of

Microbiology 34, 559–592 (1980).

162

117. Iwase, T. et al. Staphylococcus epidermidis Esp inhibits Staphylococcus aureus biofilm formation and nasal colonization. Nature 465, 346–349 (2010).

118. Rupp, M. E. & Archer, G. L. Coagulase-Negative Staphylococci: Pathogens Associated with Medical Progress. Clinical Infectious Diseases 19, 231–245 (1994).

119. WISPLINGHOFF, H. et al. Nosocomial bloodstream infections in pediatric patients in

United States hospitals: epidemiology, clinical features and susceptibilities. The Pediatric

Infectious Disease Journal 22, 686–691 (2003).

120. Jukes, L. et al. Rapid differentiation of Staphylococcus aureus, Staphylococcus epidermidis and other coagulase-negative staphylococci and meticillin susceptibility testing directly from growth-positive blood cultures by multiplex real-time PCR. Journal of Medical

Microbiology 59, 1456–1461 (2010).

121. Murdoch, D. R. Clinical Presentation, Etiology, and Outcome of Infective Endocarditis in the 21st Century. Archives of Internal Medicine 169, 463 (2009).

122. Phillips, J. E., Crane, T. P., Noy, M., Elliott, T. S. J. & Grimer, R. J. The incidence of deep prosthetic infections in a specialist orthopaedic hospital. The Journal of Bone and Joint

Surgery. British volume 88-B, 943–948 (2006).

123. Nickinson, R. S. J., Board, T. N., Gambhir, A. K., Porter, M. L. & Kay, P. R. The microbiology of the infected knee arthroplasty. International Orthopaedics 34, 505–510 (2009).

124. Rohde, H. et al. Polysaccharide intercellular adhesin or protein factors in biofilm accumulation of Staphylococcus epidermidis and Staphylococcus aureus isolated from prosthetic hip and knee joint infections. Biomaterials 28, 1711–1720 (2007).

163

125. Peters, G., Locci, R. & Pulverer, G. Adherence and Growth of Coagulase-Negative

Staphylococci on Surfaces of Intravenous Catheters. Journal of Infectious Diseases 146, 479–

482 (1982).

126. Jämsen, E. et al. Outcome of prosthesis exchange for infected knee arthroplasty: the effect of treatment approach. Acta Orthopaedica 80, 67–77 (2009).

127. Büttner, H., Mack, D. & Rohde, H. Structural basis of Staphylococcus epidermidis biofilm formation: mechanisms and molecular interactions. Frontiers in Cellular and Infection

Microbiology 5, (2015).

128. Mack, D., Davies, A. P., Harris, L. G., Knobloch, J. K. M. & Rohde, H. Staphylococcus epidermidis Biofilms: Functional Molecules, Relation to Virulence, and Vaccine Potential.

Glycoscience and Microbial Adhesion 157–182 (2008). doi:10.1007/128_2008_19

129. Heilmann, C., Hussain, M., Peters, G. & Gotz, F. Evidence for autolysin-mediated primary attachment of Staphylococcus epidermidis to a polystyrene surface. Molecular

Microbiology 24, 1013–1024 (1997).

130. Zoll, S. et al. Structural Basis of Cell Wall Cleavage by a Staphylococcal Autolysin.

PLoS Pathogens 6, e1000807 (2010).

131. Otto, M. Physical stress and bacterial colonization. FEMS Microbiology Reviews 38,

1250–1270 (2014).

132. Patti, J. M., Allen, B. L., McGavin, M. J. & Hook, M. MSCRAMM-Mediated Adherence of Microorganisms to Host Tissues. Annual Review of Microbiology 48, 585–617 (1994).

164

133. McCrea, K. W. et al. The serine-aspartate repeat (Sdr) protein family in Staphylococcus epidermidis. Microbiology 146, 1535–1546 (2000).

134. Biswas, R. et al. Activity of the major staphylococcal autolysin Atl. FEMS Microbiology

Letters 259, 260–268 (2006).

135. Qin, Z. et al. Role of autolysin-mediated DNA release in biofilm formation of

Staphylococcus epidermidis. Microbiology 153, 2083–2092 (2007).

136. Moormeier, D. E., Bose, J. L., Horswill, A. R. & Bayles, K. W. Temporal and Stochastic

Control of Staphylococcus aureus Biofilm Development. mBio 5, e01341-14-e01341-14 (2014).

137. Costerton, J. W., Lewandowski, Z., Caldwell, D. E., Korber, D. R. & Lappin-Scott, H. M.

Microbial Biofilms. Annual Review of Microbiology 49, 711–745 (1995).

138. Mack, D. et al. The intercellular adhesin involved in biofilm accumulation of

Staphylococcus epidermidis is a linear beta-1,6-linked glucosaminoglycan: purification and structural analysis. Journal of Bacteriology 178, 175–183 (1996).

139. Gerke, C., Kraft, A., Süßmuth, R., Schweitzer, O. & Götz, F. Characterization of theN-

Acetylglucosaminyltransferase Activity Involved in the Biosynthesis of theStaphylococcus epidermidisPolysaccharide Intercellular Adhesin. Journal of Biological Chemistry 273, 18586–

18593 (1998).

140. Vuong, C. et al. A Crucial Role for Exopolysaccharide Modification in Bacterial Biofilm

Formation, Immune Evasion, and Virulence. Journal of Biological Chemistry 279, 54881–54886

(2004).

165

141. O’Gara, J. P. icaand beyond: biofilm mechanisms and regulation inStaphylococcus epidermidisandStaphylococcus aureus. FEMS Microbiology Letters 270, 179–188 (2007).

142. Tormo, M. A. et al. SarA Is an Essential Positive Regulator of Staphylococcus epidermidis Biofilm Development. Journal of Bacteriology 187, 2348–2356 (2005).

143. Christner, M. et al. The giant extracellular matrix-binding protein ofStaphylococcus epidermidismediates biofilm accumulation and attachment to fibronectin. Molecular

Microbiology 75, 187–207 (2010).

144. Conrady, D. G., Wilson, J. J. & Herr, A. B. Structural basis for Zn2+-dependent intercellular adhesion in staphylococcal biofilms. (2013). doi:10.2210/pdb4fun/pdb

145. Schaeffer, C. R. et al. Accumulation-Associated Protein Enhances Staphylococcus epidermidis Biofilm Formation under Dynamic Conditions and Is Required for Infection in a Rat

Catheter Model. Infection and Immunity 83, 214–226 (2014).

146. Rohde, H. et al. Induction ofStaphylococcus epidermidisbiofilm formation via proteolytic processing of the accumulation-associated protein by staphylococcal and host proteases.

Molecular Microbiology 55, 1883–1895 (2005).

147. Conlon, B. P. et al. Role for the A Domain of Unprocessed Accumulation-Associated

Protein (Aap) in the Attachment Phase of the Staphylococcus epidermidis Biofilm Phenotype.

Journal of Bacteriology 196, 4268–4275 (2014).

148. Rohde, H. et al. Polysaccharide intercellular adhesin or protein factors in biofilm accumulation of Staphylococcus epidermidis and Staphylococcus aureus isolated from prosthetic hip and knee joint infections. Biomaterials 28, 1711–1720 (2007).

166

149. Gruszka, D. T. et al. Staphylococcal biofilm-forming protein has a contiguous rod-like structure. Proceedings of the National Academy of Sciences 109, E1011–E1018 (2012).

150. Speziale, P., Pietrocola, G., Foster, T. J. & Geoghegan, J. A. Protein-based biofilm matrices in Staphylococci. Frontiers in Cellular and Infection Microbiology 4, (2014).

151. Histidine Kinases in Signal Transduction. (2003). doi:10.1016/b978-0-12-372484-

7.x5000-0

152. Gryllos, I., Levin, J. C. & Wessels, M. R. The CsrR/CsrS two-component system of group A Streptococcus responds to environmental Mg2+. Proceedings of the National Academy of Sciences 100, 4227–4232 (2003).

153. Tatsuno, I., Okada, R., Zhang, Y., Isaka, M. & Hasegawa, T. Partial loss of CovS function in Streptococcus pyogenes causes severe invasive disease. BMC Research Notes 6, 126

(2013).

154. Marchler-Bauer, A. & Bryant, S. H. CD-Search: protein domain annotations on the fly.

Nucleic Acids Research 32, W327–W331 (2004).

155. Churchward, G. The two faces of Janus: virulence gene regulation by CovR/S in group A streptococci. Molecular Microbiology 64, 34–41 (2007).

156. Gusa, A. A., Gao, J., Stringer, V., Churchward, G. & Scott, J. R. Phosphorylation of the

Group A Streptococcal CovR Response Regulator Causes Dimerization and Promoter-Specific

Recruitment by RNA Polymerase. Journal of Bacteriology 188, 4620–4626 (2006).

167

157. , T. L. & Scott, J. R. CovS Inactivates CovR and Is Required for Growth under

Conditions of General Stress in Streptococcus pyogenes. Journal of Bacteriology 186, 3928–

3937 (2004).

158. Alam, F. M., Turner, C. E., Smith, K., Wiles, S. & Sriskandan, S. Inactivation of the

CovR/S Virulence Regulator Impairs Infection in an Improved Murine Model of Streptococcus pyogenes Naso-Pharyngeal Infection. PLoS ONE 8, e61655 (2013).

159. Buchanan, J. T. et al. DNase Expression Allows the Pathogen Group A Streptococcus to

Escape Killing in Neutrophil Extracellular Traps. Current Biology 16, 396–400 (2006).

160. Walker, M. J. et al. Disease Manifestations and Pathogenic Mechanisms of Group A

Streptococcus. Clinical Microbiology Reviews 27, 264–301 (2014).

161. Uchiyama, S., Andreoni, F., Schuepbach, R. A., Nizet, V. & Zinkernagel, A. S. DNase

Sda1 Allows Invasive M1T1 Group A Streptococcus to Prevent TLR9-Dependent Recognition.

PLoS Pathogens 8, e1002736 (2012).

162. Fournier, B. & Hooper, D. C. A New Two-Component Regulatory System Involved in

Adhesion, Autolysis, and Extracellular Proteolytic Activity of Staphylococcus aureus. Journal of

Bacteriology 182, 3955–3964 (2000).

163. Liang, X. et al. Global Regulation of Gene Expression by ArlRS, a Two-Component

Signal Transduction Regulatory System of Staphylococcus aureus. Journal of Bacteriology 187,

5486–5492 (2005).

168

164. Holden, M. T. G. et al. Complete genomes of two clinical Staphylococcus aureus strains:

Evidence for the rapid evolution of virulence and drug resistance. Proceedings of the National

Academy of Sciences 101, 9786–9791 (2004).

165. Altschul, S. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997).

166. Altschul, S. F. et al. Protein database searches using compositionally adjusted substitution matrices. FEBS Journal 272, 5101–5109 (2005).

167. Fournier, B., Aras, R. & Hooper, D. C. Expression of the Multidrug Resistance

Transporter NorA from Staphylococcus aureus Is Modified by a Two-Component Regulatory

System. Journal of Bacteriology 182, 664–671 (2000).

168. Shah, N. et al. Reductive evolution and the loss of PDC/PAS domains from the genus

Staphylococcus. BMC Genomics 14, 524 (2013).

169. Ponting, C. P. & Aravind, L. PAS: a multifunctional domain family comes to light.

Current Biology 7, R674–R677 (1997).

170. Cheung, J. & Hendrickson, W. A. Crystal Structures of C4-Dicarboxylate Ligand

Complexes with Sensor Domains of Histidine Kinases DcuS and DctB. Journal of Biological

Chemistry 283, 30256–30265 (2008).

171. Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next- generation sequencing data. Bioinformatics 28, 3150–3152 (2012).

172. Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).

169

173. Binkowski, T. A., Naghibzadeh, S. & Liang, J. CASTp: Computed Atlas of Surface

Topography of proteins. Nucleic Acids Res 31, 3352–3355 (2003).

174. Velarde, J. J., Ashbaugh, M. & Wessels, M. R. The Human Antimicrobial Peptide LL-37

Binds Directly to CsrS, a Sensor Histidine Kinase of Group A Streptococcus, to Activate

Expression of Virulence Factors. J Biol Chem 289, 36315–36324 (2014).

175. Huey, R., Morris, G. M. & Forli, S. Using AutoDock 4 and AutoDock Vina with

AutoDockTools: A Tutorial. 32

170