APPLICATION OF MOLECULAR MODELING TO DRUG DISCOVERY AND

A Dissertation Presented By Zhouxi (Josie) Wang To The Department of Chemistry and Chemical Biology in partial fulfillment of the requirements For the degree of Doctor of Philosophy in the field of Chemistry

Northeastern University Boston, Massachusetts May, 2012

1

© 2012 Zhouxi Wang

ALL RIGHTS RESERVED

2

APPLICATION OF MOLECULAR MODELING TO DRUG DISCOVERY AND FUNCTIONAL GENOMICS

By

Zhouxi (Josie) Wang

ABSTRACT OF DISSERTATION

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Chemistry in the Department of Chemistry and Chemical Biology in the Graduate School of Northeastern University, Boston, Massachusetts May 2012

3

ABSTRACT

Molecular modeling can accelerate and guide drug design and contribute to the understanding of the biochemical functions of gene products. This thesis applies molecular modeling to facilitate the drug design for human African trypanosomiasis and develops a new modeling technique for biochemical function annotations. A special technique employed in this work is the prediction of the individual amino acids in protein structures that are involved in ligand interactions; these predicted local interaction sites are used for drug discovery and for the prediction of the biochemical function of of unknown function.

Chapter 2 applies molecular modeling to the structure based drug design for human

African trypanosomiasis (HAT) at the Aurora kinase -1 target. HAT is a vector borne disease caused by several species of trypanosomes, affecting thousands of people every year. This disease is fatal if untreated. Current therapeutic interventions are unsatisfactory, all with limited efficacy or life-threatening side effects. In humans, Aurora kinase is an important target for cancer therapies. Its homologue from the pathogenic Trypanosoma brucei, Aurora kinase -1

(TbAUK1), is a validated target for therapeutic intervention for trypanosomiasis, providing an opportunity to repurpose human Aurora kinase inhibitors for the development of TbAUK1 inhibitors. We conducted comparative modeling of TbAUK1 and docking studies to help design and prioritize inhibitors based on a series of analogs of the pyrrolopyrazole inhibitor danusertib, currently in clinical trials for cancer. The TbAUK1 model has provided further structure-based insights for design of inhibitor affinity and selectivity. New inhibitors designed using the

TbAUK1 homology model showed sub-micromolar inhibition in the T. brucei proliferation assay with 25-fold selectivity over human cells.

4

Chapter 3 describes the application of molecular modeling techniques to investigate other targets for Trypanosomiasis treatment, Trypanosoma brucei phosphodiesterase B1

(TbrPDEB1) and Trypanosoma brucei phosphodiesterase B2 (TbrPDEB2). Homology modeling and docking studies for the inhibitors that are repurposed from human phosphodiesterase 4

(PDE-4) inhibitors help to rationalize the structure-activity relationships for the piclamilast series analogs. The comparison of TbrPDEB1, TbrPDEB2 and human PDE-4 has provided insight for the next generation ligand design.

Chapter 4 describes molecular modeling techniques applied to the development of protein function annotation methodology for structural genomics proteins. The Protein Structure

Initiative (PSI) has led to significant growth in the number of protein structures. So far, over

11,000 structural genomics (SG) proteins have been deposited in the PDB and most of these SG proteins are of unknown or uncertain function. To bridge the biochemical functions and structures of the proteins, we developed a computational method to facilitate the classification and identification of the function of proteins using the 3D structures as input. A new methodology, Structurally Aligned Local Sites of Activity (SALSA) has been developed for this purpose. This method utilizes two previously developed computational predictors,

POOL and THEMATICS. As a proof of concept, the in the concanavalin-A like lectins/glucanases superfamily have been classified according to their biochemical function. The proteins in this superfamily have a similar fold, consisting of a sandwich of 12-14 antiparallel beta strands in two curved sheets. Based on the computationally predicted active site residues and a local structural alignment, the enzymes in this superfamily have been successfully sorted into six functional subgroups and information about the function of SG proteins also has been

5 provided. One SG protein has been found to be correctly annotated and four SG proteins are likely to belong to new functional subgroups.

6

ACKNOWLEDGMENTS

I am deeply indebted to my advisor, Prof. Mary Jo Ondrechen, without whom none of my thesis research would have been possible. She gave me tremendous encouragement and support through the years. She gave me the opportunity to work on a variety of projects and encouraged me to explore ideas and work independently. She has not only provided me with scientific training, but has also guided me to be a happier and stronger person.

I would also like to express deep and sincere gratitude to my co-advisor Prof. Michael

Pollastri for his constant support and patient guidance. His creative, engaging and passionate leadership in the research projects had shown me the way to be a good researcher. I also gratefully acknowledge the countless hours he spent on discussing and networking to help various projects really progress.

In addition to my advisors, I would like to thank Prof. Zhaohui Sunny Zhou and Prof.

Carla Mattos who have provided valuable input and suggestions to this dissertation. I am honored to have them as my thesis committee members.

I could not have completed this dissertation without the support, help, and great friendship from the members of Ondrechen lab and Pollastri lab, including Pengcheng Yin,

Ramya Parasuram, Joslynn Lee, Stefan Ochiana, Dr. Srinivas Somarowthu and Dr. Jaeju Ko.

It is a pleasure to thank all the faculty, student body and staff members of the Department of Chemistry and Chemical Biology, Northeastern University. Special thanks to Dr. Billy Wu,

Dr. Ed Witten, Richard Pumphrey, Jean Harris, and Sheila Magee Beare for all the help and assistance.

7

I also want thank my uncle, Guoliang Wang, aunt, Yonglian Chen and cousin Dr. Sean

Wang, who offered me unconditional support and love in my life. I also want to thank my parents, who gave me freedom and love.

Finally, I give my sincere acknowledgment to the National Science Foundation (MCB-

0843603 and MCB-1158176) and the National Institutes of Health (R01A1082577 and

R15A1082515) for support of this research.

8

TABLE OF CONTENTS

ABSTRACT 3

ACKNOWLEDGMENTS ...... 7

TABLE OF CONTENTS ...... 9

LIST OF FIGURES ...... 11

LIST OF TABLES ...... 15

CHAPTER 1: INTRODUCTION ...... 18

1.1 Homology modeling ...... 20 1.2 Docking ...... 22 1.3 The application of modeling-docking in drug discovery ...... 25 1.4 Human African Trypanosomiasis (HAT) ...... 26 1.5 Target repurposing ...... 26 1.6 Active site prediction ...... 28 1.7 Thesis Overview ...... 31 1.8 References ...... 34

CHAPTER 2: IDENTIFICATION OF NEW DRUGS FOR THE TREATMENT OF HUMAN

AFRICAN TRYPANOSOMIASIS BASED ON COMPARATIVE MODELING OF

TBAUK1 ...... 41

2.1 Introduction ...... 43 2.2 Methods ...... 46 2.3. Results and Discussion ...... 51 2.4 Conclusions ...... 80 2.5 References ...... 81

9

CHAPTER 3: STRUCTURE-BASED T. BRUCEI DRUG DESIGN USING

COMPARATIVE MODELING OF THE PHOSPHODIESTERASE TARGETS TBRPDEB1

AND TBRPDEB2 87

3.1 Introduction ...... 88 3.2 Method ...... 90 3.3 Results and discussion ...... 92 3.4 Conclusions ...... 112 3.5 References ...... 114

CHAPTER 4: STRUCTURALLY ALIGNED LOCAL SITES OF ACTIVITY (SALSAS)

FOR FUNCTIONAL CHARACTERIZATION OF STRUCTURES IN THE

CONCANAVALIN A-LIKE LECTINS/GLUCANASES SUPERFAMILY ...... 118

4.1 Introduction ...... 119 4.2 Methods ...... 125 4.3 Results and Discussion ...... 130 4.4 Conclusions ...... 159 4.5 References ...... 160

CHAPTER 5: CONCLUSIONS AND FUTURE DIRECTIONS ...... 165

APPENDIX: DESULFURIZATION OF CYSTEINE-CONTAINING

RESULTING FROM SAMPLE PREPARATION FOR PROTEIN ANALYSIS ...... 172

CURRICULUM VITAE 208

10

LIST OF FIGURES

Figure 1-1. An illustration of the major steps in the homology modeling. 21

Figure 2-1. The structure of the TbAUK1 comparative model. 53

Figure 2-2. The Ramachandran plot of the TbAUK1 model. 55

Figure 2-3. (A) An illustration of 4,5,6-tetrahydropyrrolo[3,4-c] pyrazole derivatives (danusertib analogs) and their interactions with the Aurora kinases active site subareas; (B) a list of first generation danusertib analogs assayed in the experiments. 56

Figure 2-4. (A) Superimposition of TbAUK1, TbAUK2 and human Aurora A reveals two extra helices on the TbAUK2 model on the activation loop. TbAUK1 is colored blue, TbAUK2 is colored green, the human Aurora A is colored grey. (B) Superimposition of TbAUK1, TbAUK3 and human Aurora A reveals two extra helices in the TbAUK2 model in the activation loop. 61

Figure 2-5. (A) Superimposition of TbAUK1, TbAUK2 and human Aurora A. The non- conserved residues around the ATP binding pocket are enlarged for TbAUK1, TbAUK2 and human Aurora (TbAUK1 colored blue, TbAUK2 colored green and human Aurora A colored grey) 61

(B) superimposition of TbAUK1, TbAUK3 and human Aurora A. The non-conserved residues around the ATP binding pocket are enlarged for TbAUK1, TbAUK3 and human Aurora

(TbAUK1 colored magenta, TbAUK3 colored cyan and human Aurora A colored grey) 61

Figure 2-6. Docking score for Glide SP plotted against EC50 data from cell proliferation assay for the compound in Table 2-2, excluding NEU-238. 63

Figure 2-7. (A) Glide score vs. EC50 and (B) predicted dG binding energy vs. EC50 for the compounds in the literature. 70

11

Figure 2-8. (A) The superimposition of the top ranked 22 danusertib analogs docked in the

TbAUK1 model, suggesting a similar binding mode as described before. Lys58 is colored magenta; (B) An illustration of the 20 target analogs that have been synthesized and subjected to assays. 72

Figure 2-9. Comparison of (A) the human Aurora A/danusertib complex (PDB ID: 2j50, danusertib colored in green); (B) The predicted conformation of danusertib (colored in light green)) in the human Aurora A in TbAUK1 model. Important ligand contact residues are illustrated as sticks in both views. The core substituents at R groups positioned at the phosphorylated pocket were highlighted with pink circles. (C) A superimposition of TbAUK1 vs. Human Aurora A. The important residues in the both proteins are shown in stick. 78

Figure 2-10. Comparison of the computed electrostatic potential of the binding pocket of (A) human Aurora A (PDB ID: 2j50) and (B) TbAUK1 model. 79

Figure 3-1. The Ramachandran Plots of the comparative models (A) TbrPDEB1 and (B)

TbrPDEB2. 95

Figure 3-2. The superimposition of TbrPDEB1 (yellow) and TbrPDEB2 (magenta). The predicted active sites of TbrPDEB1 and TbrPDEB2 (with 8% POOL cutoff) are shown as sticks and are almost identical. 102

Figure 3-3. (A) Piclamilast complexed with human PDE4 in the crystal structure (PDB ID:1xm4) and (B) predicted pose for piclamilast interacting with the TbrPDEB1 model 102

Figure 3-4. Three benchmark PDE compound structures of rolipram, roflumilast and piclamilast

106

Figure 3-5. The ligand binding pocket of TbrPDEB1 (colored in magenta) structurally aligned with human PDE4 (colored in blue). 111

12

Figure 4-1. 3D structure of a representative protein in Concanavalin A-like Lectins/Glucanases

Superfamily (endoglucanase from humicola grisea, PDB ID: 1uu4). The core structure for the proteins in the superfamily consists of two major anti-parallel, curved β-sheets arranged in the form of a sandwich, with a number of loops interconnecting the sheets as well as the strands between them 124

Figure 4-2. A schematic diagram of the SALSA method. 129

Figure 4-3. (A) The general mechanisms of glycoside hydrolases (GH). (B) inverting mechanism for GH and (C) retaining mechanism for GH 143

Figure 4-4. (A) Structural alignment of three xylanases from subgroup GH11, (PDB ID: 1m4w

(white), 1h4g (magenta) and 1bcx (cyan)); (B) Local alignment of POOL predicted active site residues for these 3 proteins; (C) Structure alignment of endoglucanases from subgroup GH12,

(PDB ID: 1uu4 (white), 1h8v (blue) and 2nlr(magenta)); (D) Local alignment of POOL predicted active site residues for these 3 proteins. 144

Figure 4-5. Structural alignment of the predicted active site residues of a SG protein from

Mycobacterium smegmatis str. MC2 155 (PDB ID: 3rq0) to the active site residues of five known

GH16 proteins. Local alignment of the active site residues of the SG protein confirm that it has been correctly annotated as GH16. Residues from different proteins are color coded according to the legend. 151

Figure 4-6. POOL predicted active site residues shown explicitly as sticks for the putative glycoside hydrolases with PDB ID (A) 3osd; (B) 3nmb; (C) 3hbk; and (D) 3h3l. It is also noted that one SG protein from Bacteroides thetaiotaomicron (PDB ID: 3osd) shares some similarity in its active site residues with two other SG proteins (PDB IDs: 3nmb and 3hbk). 154

13

Figure 4-7. A GH12 protein (PDB: 1m4w) from Nonomuraea flexuosa compared to a SG protein from Parabacteroides distasonis atcc 8503. 158

14

LIST OF TABLES

Table 2-1. The predicted important residues in the TbAUK1 model using THEMATICS and

POOL (with 8% cutoff). 55

Table 2-2. The structurally aligned, predicted functionally important residues in the models of

TbAUK1, TbAUK2, TbAUK3, in human Aurora -A (PDB ID: 2bmc), and in mouse Aurora

(PDB ID: 3d14). 60

Table 2-3. The docking score and binding free energy calculated using Prime MM/GBSA for danusertib and its derivative compounds, corresponding to the EC50 in the T. brucei anti- proliferation assay. 66

Table 2-4. Dose-response experiments on the parallel array of analogs of danusertib. 72

Table 2-5. The 3D structural alignment of the predicted important active site residues for the

TbAUK1 model structure vs. human Aurora A (PDB ID: 2j50). Each row represents a protein structure and each column represents a spatial position in the structural alignment. The boldface letters indicate the residues predicted to be functionally important using the top 8% of the POOL ranking. The aligned residues, which are different between TbAUK1 and human Aurora A are marked with an asterisk. 'y' indicates the important ligand contact residues reported in the literature. 76-77

Table 3-1. Structural alignment of the predicted functionally important active site residues of

TbrPDEB1, TbrPDEB2 and Lmj PDEB1 obtained with (A) 8% POOL cutoff ; (B) with 10%

POOL cutoff. The residues annotated as important residues using POOL calculations are shown

15 in boldface. The differences in residue types between TbrPDEB1 and TbrPDEB2 are highlighted in yellow. 99

Table 3-2. Summary of the docking and SAR results for a series of piclamilast analogues. 106

Table 3-3. The structural alignment of the important active site residues for human PDE4,

LmjPDEB and TbPDEBs. Residue differences between TbrPDEBs and human PDE are marked with an asterisk*. 114

Table 4-1. A list of proteins analyzed in the concanavalin A-like Lectins/Glucanases

Superfamily. 126

Table 4-2. (A) The 3D structural alignment of the predicted local active site residues for proteins in the concanavalin A-like lectins/ glucanases superfamily. (B) Normalized match scores of the aligned local active sites for the proteins in (A) obtained with the BLOSUM 62 scoring matrix.

(C) Normalized match scores of the aligned local active sites for the proteins in (A) obtained with the Chemical Specificity Matrix scoring matrix. (D) A list of the consensus signatures for each subgroup based on (A), presented in the format R-N, where R corresponds to the one letter code of for the consensus signatures and N stands for the number of the spatial position listed in (A). 137

Table 4-3. Local active site alignment of one crystal structure of Alpha-L-arabinofuranosidase B

(GH54, PDB ID: 1wd3) from Aspergillus kawachii and one comparative model of Alpha-L- arabinofuranosidase B from Actinoplanes sp (strain ATCC 31044 / CBS 674.73 / SE50/110 . 138

Table 4-4. (A) Predicted functionally important active site residues from POOL for five well- studied glycoside hydrolases and one SG protein from Mycobacterium smegmatis str. MC2 155

(PDB ID: 3rq0) structurally aligned to each other. (B) The sequence identity of the SG protein and 5 GH16 proteins; (C) The PM values of the SG protein and 5 GH16 proteins. 150

16

Table 4-5. (A) Normalized match scores of proteins in the concanavalin A-like lectin/ glucanases superfamily scored against the consensus signatures for each subgroup using

BLOSUM62; (B) Sequence identity of proteins in concanavalin A-like lectin/ glucanases superfamily scored against the proteins preventatives for the functional subgroups. 153

Table 5-1. Comparison of the computed properties of current lead molecules, two danusertib analogs that are likely inhibitors of TbAUK1 (NEU325 and NEU327) to the CNS-active drugs.

170

17

CHAPTER 1 INTRODUCTION

18

Common molecular modeling techniques consist of structural modeling, molecular mechanics, docking/ virtual screening, molecular dynamics simulations, and conformational analysis.7 These molecular modeling techniques use theoretical and computationally based methods to model or mimic the behavior of molecules and have been widely applied for understanding and predicting the behavior of molecular systems.7 In this thesis, I focus on their application to drug design for neglected tropical diseases and also to genomics.

Molecular modeling has become an essential part of contemporary drug discovery processes. A traditional approach for drug discovery relies on step-wise synthesis and screening of large numbers of compounds to optimize activity profiles; this is extremely time consuming and costly. The cost of these processes has increased significantly in recent years 8 and it takes over a decade for a very small fraction of compounds to pass the drug discovery pipeline from initial screening hits or leads, chemical optimization, and clinical trials before launching into the market. Considering the need to lower the cost and accelerate drug discovery processes, any tools that can improve the efficiency of drug discovery are highly desirable. Molecular modeling is one of these tools that can significantly cut the cost, labor, and time, and increase the efficiency, of the drug discovery process. 9

Molecular modeling is of importance in genomics as well10. Structural genomics (SG) projects are intended to expand the structural knowledge of the protein products encoded by the genome using high-throughput methods. Large scale protein structure analysis has been applied to whole genomes, e.g. Mycoplasma genitalium11, Saccharomyces cerevisiae12, and Escherichia coli13. Given resource limitations, it is impossible to determine every 3-dimensional structure of every protein via experimental methods. Using x-ray crystallography and nuclear magnetic resonance (NMR) techniques to elucidate new structures, the plan for SG efforts is to determine

19 at least one structure from each family10. The remaining structures may be achieved by molecular modeling techniques, including homology modeling (also called comparative modeling)14, fold assignment or threading15 and ab initio protein structure prediction. 13, 16 In addition, molecular modeling techniques are powerful tools that provide insight into protein/gene biological function, guide drug design for target genes,17 and aid protein engineering. 18

Considering the importance of these molecular modeling techniques, two commonly used techniques, homology modeling and docking, which have been widely applied in both drug discovery and structural genomics, are briefly introduced.

1.1 Homology modeling

Homology modeling, also called comparative modeling, provides the 3D structure of a protein sequence of unknown structure: the "target." Homology modeling requires that there is an experimental structure available of a sequence homolog that can serve as a template. It is the most detailed and accurate among all of the current protein structure prediction techniques. The technique is established on the concept that structures of evolutionarily related proteins with unknown structures can be constructed based on the experimentally determined structures of proteins with similar sequences. 7 The "twilight zone" for homology modeling is in the sequence identity range of 25-30% between target protein and template7; homology models can be built with reasonable reliability if the sequence identity is higher than this. Threading15 and ab initio modeling10 can be applied to obtain 3D protein structures when sequence identity between target protein and template is below the twilight range or when no template is available for modeling.

As shown in Figure 1, the steps of the homology modeling process generally consist of:

(1) family and/or fold assignment of the target sequence and template selection; (2) target-

20 template alignment; (3) comparative model building; and (4) model assessment. The requirement for homology modeling is that the target sequence must be correlated to at least one known protein structure. To do this, one or more available protein structures are selected as the modeling template(s) and the template amino acid sequence is aligned to the target sequence.

The available template structures are usually X-ray crystal structures or NMR solution structures.

Possible templates are found from sequence searching against a database of known structures.

Figure 1-1. An illustration of the major steps in the homology modeling

21

One of the important factors in determining the quality and application of the homology model is the percentage of the sequence identity between the target and template. In general, a higher sequence identity between target protein and template means that the resultant model is more reliable and accurate. When the sequence identity between the template and protein is lower than 30%, though the model is less reliable than those that share higher sequence identity to templates, it can still provide useful information on the binding/active sites and functional annotation using fold assignment. 19 When the sequence identity between template and target is around 30-50%, the models tend to have >85% of the Cα atoms within 3.5Å of their correct positions. The homology models then are useful tools to support experimental design for site- directed mutagenesis, examining the alteration of binding capacity and screening virtual libraries for potential inhibitors.20 When the overall sequence identity between template and target is >60%, the quality of the resultant model is often comparable to standard resolution crystallography, and the model can be more reliable for ligand docking and drug design.10

Additionally, bioinformatics tools can further help to assess the stereochemical quality of models, such as Ramachandran plot analysis21, Verify3D22, ANOLEA23, Molprobity24 and Procheck25.

1.2 Docking

Docking is a widely applied technique for studying the interaction between ligands and proteins and for virtual screening for potential hit compounds for protein targets. Traditionally, the docking process was thought of in terms of “lock-and-key”. Ligands, including small molecules, DNA/RNA, lipids, and even other proteins serve as the “key”, while the receptor serves as the “lock”. Nowadays, the docking process is considered in terms of "hand-and-glove",

22 because ligand binding is now more commonly thought of as an “induced fit” process wherein both proteins and ligands are flexible.26

1.2.1 Classification of docking programs

The docking process involves explicitly predicting the binding geometry and interactions between ligand and receptor (pose) and estimating the binding affinity of the ligand-receptor complex (scoring). Flexible docking programs can be broadly divided into two classes based on the conformational searching algorithms applied, including system methods (incremental construction, conformational search) and stochastic/random methods (Monte Carlo, genetic algorithms) 27. Broadly used docking programs, such as DOCK 628 and FlexX29 belong to the first class; while GOLD30 and AutoDock31 belong to the second one. The GLIDE32 program uses a combination of conformational searching and stochastic algorithms.

1.2.2 Accuracy of docking

The accuracy of the docking is determined from test cases where an experimentally determined ligand-protein structure is known. This is achieved by measuring the RMSD of all non-hydrogen ligand atoms between the lowest -energy poses resultant from docking and the experimental structures from crystallography. A docked ligand posed in a protein with a RMSD less than 2Å compared to the experimental structure has been generally considered to be an accurate pose. The accuracy of widely used docking programs, including DOCK 628, FlexX29,

FRED33, GLIDE32, GOLD30, and SURFLEX34 has been assessed and these programs have been proved to generate reliable poses in numerous cases35. Currently, more studies are underway to improve the robustness of the methods, since all these programs report some false negatives,

23 where an active compound is predicted to be inactive with respect to a target, and false positives, where a weak binding affinity between compound and target is over-predicted, in some cases. 27,

35-36

1.2.3 Scoring functions of docking

While there is substantial success in prediction of the pose of a ligand in a protein, the estimation of binding affinity of the ligand -receptor complex is still a challenging task. Scoring functions in docking programs take the ligand-receptor pose as input and provide ranking or estimation of the binding affinity of the pose. Currently, the scoring functions are roughly cataloged into three classes. The first class of scoring functions uses an empirical based potential.37 These scoring functions require the availability of receptor-ligand complexes with known binding affinity and use the sum of several energy terms such as van der Waals potential, electrostatic potential, hydrophobicity and hydrogen bonds in binding energy estimation. The weights of these different energy terms are assigned by regression methods used to fit the experimentally determined values of protein-ligand complexes in the training set. For instance,

FlexX29 uses this class of approach to estimate the binding affinity.

The second class consists of force field-based scoring functions, which use atomic force fields to calculate free energies of binding31, 38. This class of scoring functions is similar to the empirical one in that both classes add the individual contributions from different types of interactions, but the interaction terms in the force field based scoring functions are derived from summing the strength of intermolecular interactions between all atoms of the two molecules in the complex, not directly from empirical data. These force fields include Assisted Model

Building and Energy Refinement (AMBER)39 and Chemistry at Harvard Molecular Mechanics

(CHARMM)40. These scoring functions may be further coupled with methods such as free

24 energy perturbations (FEP)41 and thermodynamic integration (TI)42 for higher accuracy in binding energy calculations. The docking methods using scoring functions of this class include

GOLD30, AutoDock31, Dock 628, and Glide32.

Last but not least, the last class of the scoring functions is knowledge-based, which uses statistical methods to extract features on atomic interactions from experimentally determined protein-ligand complexes. For example, PMF43, DrugScore44 and SMoG45 are programs applying this class of scoring functions.

1.3 The application of modeling-docking in drug discovery

A combination of protein modeling and docking studies (modeling-docking) is one of the most important methodologies in structure-based drug design. It has been widely applied in multiple target-oriented drug discovery projects. For example, homology modeling of Cdc25 phosphatases46, protein tyrosine phosphatase Shp2,47 and RSK-2,48 followed by virtual screening, have been applied to anticancer drug design. Moreover, modeling-docking studies have helped drug design for the 6-phosphofructo-2-kinases (PFKFB3) target for glycolytic flux and tumor growth49 ; lead compound discovery for the CK1δ target for Alzheimer's disease50; and drug design for the chemokine CCR5 target for HIV infection51.

The modeling-docking methodology has also facilitated drug design derived from genomics10. For example, modeling-docking methodology has help to accelerate hit compound discovery for severe acute respiratory syndrome (SARS) coronavirus 3C-like proteinase, which is considered as a potential drug design target for the treatment of SARS. The overall lead discovery process from genome, modeling, virtual screening, compound purchase/synthesis and assay took only two months! The authors predicted that this remarkably fast timeline could be shortened further with the help of teraflops computing resources52.

25

1.4 Human African Trypanosomiasis (HAT)

In this thesis, I have applied molecular modeling to guide drug design for human African trypanosomiasis, a neglected tropical disease. Neglected tropical diseases are diseases that affect impoverished populations in the developing world. Human African trypanosomiasis (HAT), or

African sleeping sickness, is a vector-borne disease caused by two subspecies of the genus

Trypanosoma: Trypanosoma brucei gambiense in West Africa and Trypanosoma brucei rhodesiense in East Africa. The disease is transmitted by the bite of the tsetse fly. It is lethal if left untreated and can spread rapidly through populations when surveillance and treatment programs are disrupted53. The disease progresses through two stages: the disease spreads in the bloodstream (Stage I) and then the parasites invade the central nervous system (Stage II). It is in this later stage that patients begin to show the trademark symptoms of sleeping sickness: lethargy, sleepiness, and coma. This disease affects millions of people per year. 54

Current treatments available for HAT are largely ineffective due to T. brucei resistance and the toxicity of the drugs to the patients, e.g. melarsoprol, a toxic arsenical agent that has a

5% mortality rate itself. Another drug, eflornithine, has toxic side effects and requires an extensive dosing regimen spread out over 14 days. Although a nifurtimox-eflornithine combination therapy decreases the dosing time and toxicity 55, an acute need persists for safe, inexpensive and convenient therapeutics.

1.5 Target repurposing

In order to accelerate and lower the cost of inhibitor discovery for HAT, the “target repurposing” strategy is applied. Target repurposing exploits the commonly observed property that drug-like chemical species can often bind homologous proteins, in addition to the target

26 protein to which these molecules were initially designed to bind. If the homologous protein is a potential drug target for another disease, then this cross-binding can guide repositioning of the discovery program from one disease to another. When a homolog of human targets also appears to be a druggable target for the parasitic pathogen, the repositioning of the drug-like chemical for human targets can speed up a drug-discovery program using available data for synthesis, structural biology and structure-activity relationships (SAR), pharmacology, toxicology, and pharmacokinetics of the chemicals already developed against the initial human target.

Target repurposing could seed new drug discovery programs against the parasite targets with low cost and high efficiency when essential enzyme targets in infectious agents are matched with human homologues that have been pursued for other indications. 56 Several examples have illustrated the success of this strategy in HAT drug discovery. For instance, N- myristoyltransferase (NMT) is a protein that plays an important role in post-translational modification. Recently, RNA interference has demonstrated that a homologue of the human protein in trypanosomes is essential for trypanosome growth. This trypanosomal protein shares

55% sequence identity to the human one, providing the opportunity for targeting repurposing. A high-throughput screening of existing of chemical matters against T. brucei NMT retrieves a compound with micromolar inhibition of trypanosome growth. It is noted that compounds from the human NMT program were not reused but rather the high throughput screen results from the random library was used. The structural biology and other aspects of NMT drug discovery has certainly informed work against TbNMT. A process of further optimization generated compounds with good potency and selectivity over the human NMT. It also has promising pharmaceutical properties, can rapidly kill trypanosomes both in vitro and in vivo and cures trypanosomiasis in mice. 57

27

A further example is eflornithine, one of two drugs for treatment of HAT (stage II). It is a suicide inhibitor of ornithine decarboxylase that regulates polyamine biosynthetic pathways that are involved in the generation of small-amine intermediates that are incorporated into nucleic acid and amino acid synthesis. This drug was primarily designed for human cancer treatment and was stopped in clinical trial due to its poor efficacy against cancer. This drug has been repurposed to anti- trypanosomal indication and has been shown to clear Trypanosoma brucei gambiense infections in humans, though it is not effective against the Trypanosoma brucei rhodesiense strain. 58

Seeing the success of molecular modeling in previous drug discovery projects, it is expected that molecular modeling will be a valuable tool for target repurposing. Currently, one of the main obstacles for HAT drug repurposing is the lack of 3D structures of the parasitic proteins. Homology modeling provides 3D structures of these proteins as a starting point for structure-based inhibitor design. Molecular modeling is also an important tool in providing information about the parasite target binding sites in comparison to the human target; in prioritizing compounds for benchmark screening against the parasite enzymes; and driving SAR development for potency against the parasite target and for selectivity over the human ones.

1.6 Active site prediction

For applications in drug discovery and in function prediction, it is useful to have the ability to predict the functionally important amino acid residues in a protein structure. Originally such predictions were made solely with informatics-based tools. However, new molecular modeling tools can predict catalytic and binding sites in protein structures. These tools are especially important in the post-genome sequencing era. As of the spring of 2011, the structures

28 of over 10,000 Structural Genomics proteins have been deposited in the Protein Data Bank (PDB)

59; most of these structures are of unknown or uncertain function. 60

The most common active site prediction methods are sequence-based, structure-based, or a combination of both. The sequence-based methods rely on sequence comparison or on evolutionary information derived from sequence alignments. These methods assume that the highly conserved regions among sequences in similar proteins from different sources

(species/tissues), or even in different proteins but with similar functions, are most likely to be the functionally important active site residues. ConSurf, Evolutionary Trace, 61 and INTREPID62 are examples of sequence-based methods. Evolutionary Trace and INTREPID utilize phylogenetic trees. Methods based only on sequence conservation can transfer functional information of proteins in 30% cases, when two proteins sequences have identity of at least 50% 63. Even for the protein pairs with BLAST E-values below 10 -50 are not sufficient to automatically transfer enzyme functions without errors.64

The structure-based methods extract different structural features for protein active site prediction. For example, Ligsite65, Surfnet66, CASTp67, and ConCavity68 use protein surface topography for the prediction of protein active site residues. PocketFinder and Q-siteFinder 69are docking based methods using small molecule, solvent-like probes to identify active sites in the proteins. THEMATICS (Theoretical Microscopic Anomalous Titration Curve Shapes) uses properties derived from calculated electrostatic information about a protein structure by computing the theoretical titration curves for all the ionizable residues (Arg, Asp, Cys, Glu, His,

Lys and Tyr) in a protein structure in order to reveal active site residues. While a typical ionizable residue in a protein is likely to have a classical, sigmoidal titration curve shape, obeying the Henderson-Hasselbalch equation for a monoprotonic acid; the active site residues

29 have been identified as exhibiting perturbed titration behavior and simple statistical criteria have been adopted for the selection of these residues. 70 Recently, a machine learning based method for protein active site residue prediction, called Partial Order Optimum Likelihood (POOL)71 has been developed. This method calculates a score that is proportional to the probability that a residue is in the active site of a protein. Initially, POOL used the THEMATICS results as input.

71b However, POOL can accept any residue-specific input feature as long as the probability of the functional importance of a residue depends monotonically on that input feature. While

THEMATICS predicts the ionizable residues, POOL extends the predictions to incorporate non- ionizable residues and outperforms other methods in the prediction of active site residues. 71

30

1.7 Thesis Overview Thesis Aim

This thesis explores the application of molecular modeling techniques to the guidance of the development of new inhibitors of selected targets for the treatment of a neglected tropical disease, HAT, in target repurposing projects. Homology models of druggable targets have been established for T. brucei Aurora kinase-1 (TbAUK1) and T. brucei cyclic nucleotide phosphodiesterase-B1 (TbrPDEB1). Docking studies using the homology models have provided insight for drug design and the prioritization of the compounds for synthesis.

In addition, molecular modeling techniques have been applied to the development of protein function annotation methodology for structural genomics. A new methodology,

Structurally Aligned Local Sites of Activity (SALSA), based on predicted active sites from

THEMATICS/POOL, has been developed. The SALSA method is used to sort the members of a superfamily according to their biochemical function and to annotate the biochemical function of the protein structures of unknown function within that superfamily. The enzymes in the

Concanavalin A-like lectin/glucanase superfamily are examined in this thesis.

Chapter 1

Introduction

This chapter introduces the molecular modeling techniques, homology modeling and docking in drug discovery and genomics projects. A survey of their applications to in drug discovery, structural genomics and neglected diseases target repurposing is provided.

Additionally, new molecular modeling tools that predict catalytic and binding sites in protein structures are also introduced to provide the background information for subsequent chapters.

31

Chapter 2

Identification of new drugs for the treatment of human African trypanosomiasis based on comparative modeling of TbAUK1

This chapter describes the establishment and validation of a homology model structure for Trypanosoma brucei. (T. brucei) Aurora kinase-1 (TbAUK1), a druggable target for

Trypanosomiasis treatment. Ligand binding residues, computationally predicted by

THEMATICS and POOL, together with compound library virtual screening using docking studies, provide computational guidance for TbAUK1 structure-based inhibitor design. Some newly designed ligands are found to be effective in parasite killing in vitro and display good selectivity over host cells.

Chapter 3

Homology modeling for Pharmacological Validation of Trypanosoma brucei

Phosphodiesterases B1 and B2 as Druggable Targets for African Sleeping Sickness.

This chapter describes the application of molecular modeling techniques to investigate other targets for Trypanosomiasis treatment, Trypanosoma brucei phosphodiesterases B1

(TbrPDEB1) and Trypanosoma brucei phosphodiesterases B2 (TbrPDEB2). Homology modeling and docking studies for the inhibitors that are repurposed from human PDE-4 inhibitors help to rationalize the structure -active relationships for the piclamilast series analogs.

The comparison of TbrPDEB1, TbrPDEB2 and human PDE-4 has provided insight for the next generation ligand design.

32

Chapter 4

Structurally Aligned Local Sites of Activity (SALSAs) for Functional Characterization of

Enzyme Structures in the Concanavalin A-like Lectins/Glucanases Superfamily

This chapter explores new methodology for protein function annotation. The functional residues predicted from POOL are used to define a local functional site in each protein. These local structures are aligned and compared to infer the biochemical function of the protein from

Structurally Aligned Local Sites of Activity (SALSAs). The SALSA method is applied to the

Concanavalin A-like lectin/glucanase superfamily analysis. According to the predicted local active sites, enzymes in this superfamily can be sorted successfully into seven functional subgroups including glycoside hydrolase-16; glycoside hydrolase-11; glycoside hydrolase-12; glycoside hydrolase-7; glycoside hydrolase-54; peptidase A4; and alginate lyase. Structural genomics proteins of previously unknown function in this superfamily are further analyzed using this methodology.

Chapter 5

Future directions for this work are described.

Appendix

Desulfurization of cysteine-containing peptides resulting from sample preparation for protein characterization by mass spectrometry

33

1.8 References 1. Jetton, N.; Rothberg, K. G.; Hubbard, J. G.; Wise, J.; Li, Y.; Ball, H. L.; Ruben, L., The cell cycle as a therapeutic target against Trypanosoma brucei: Hesperadin inhibits Aurora kinase- 1 and blocks mitotic progression in bloodstream forms. Molecular Microbiology 2009, 72 (2), 442-458. 2. Linial, M.; Yona, G., Concept Clearance for the PSI-2 Production Phase Methodologies for target selection in structural genomics. Prog Biophys Mol Biol 2000, 73, 297 - 320. 3. Somarowthu, S.; Yang, H. Y.; Hidebrand, D. G. C.; Ondrechen, M. J., High-Performance Prediction of Functional Residues in Proteins with Machine Learning and Computed Input Features. Biopolymers 2011, 95 (6), 390-400. 4. Tong, W.; Wei, Y.; Murga, L. F.; Ondrechen, M. J.; Williams, R. J., Partial Order Optimum Likelihood (POOL): Maximum Likelihood Prediction of Protein Active Site Residues Using 3D Structure and Sequence Properties. Plos Computational Biology 2009, 5 (1). 5. Ko, J. J.; Murga, L. F.; Andre, P.; Yang, H. Y.; Ondrechen, M. J.; Williams, R. J.; Agunwamba, A.; Budil, D. E., Statistical criteria for the identification of protein active sites using theoretical microscopic titration curves. Proteins 2005, 59 (2), 183-195. 6. Porter, C. T.; Bartlett, G. J.; Thornton, J. M., The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Research

2004, 32, D129-D133.

7. Leach, A. R. Molecular Modelling: Principles and Applications 2001. 8. Ooms, F., Molecular modeling and computer aided drug design. Examples of their applications in medicinal chemistry. Curr. Med. Chem. 2000, 7 (2), 141-158. 9. Cavasotto, C. N.; Phatak, S. S., Homology modeling in drug discovery: current trends and applications. Drug Discov. Today 2009, 14 (13-14), 676-683. 10. Sanchez, R.; Pieper, U.; Melo, F.; Eswar, N.; Marti-Renom, M. A.; Madhusudhan, M. S.; Mirkovic, N.; Sali, A., Protein structure modeling for structural genomics. Nature Structural Biology 2000, 7, 986-990.

34

11. Sanchez, R.; Sali, A., Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proceedings of the National Academy of Sciences of the United States of America 1998, 95 (23), 13597-13602. 12. Burley, S.; Almo, S.; Bonanno, J.; Capel, M.; Chance, M.; Gaasterland, T.; Lin, D.; Sali, A.; Studier, F.; Swaminathan, S., Structural genomics: beyond the human genome project. Nature Genet 1999, 23, 151 - 157. 13. Marsden, R.; Lewis, T.; Orengo, C., Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint. Bmc Bioinformatics 2007, 8 (1), 86. 14. (a) Weston, G. S.; Sindelar, R. D., Construction of an initial model for human granzyme B from rat mast cell protease 2 using comparative protein homology modeling techniques. Abstracts of Papers American Chemical Society 1993, 206 (1-2), 251; (b) Srinivasan, N.; Blundell, T. L., an evaluation of the performance of an automated procedure for comparative modeling of protein tertiary structure. Protein Engineering 1993, 6 (5), 501-512. 15. (a) Cristobal, S.; Zemla, A.; Fischer, D.; Rychlewski, L.; Elofsson, A., A study of quality measures for protein threading models. Bmc Bioinformatics 2001, 2, 5; (b) Kolinski, A.; Betancourt, M. R.; Kihara, D.; Rotkiewicz, P.; Skolnick, J., Generalized comparative modeling (GENECOMP): A combination of sequence comparison, threading, and lattice modeling for protein structure prediction and refinement. Proteins-Structure Function and Genetics 2001, 44 (2), 133-149. 16. (a) Perakyla, M.; Pakkanen, T. A., Ab-initio models for receptor-ligand interactions in proteins .3. model assembly study of the proton-transfer in the hydroxylation step of the catalytic mechanism of p-hydroxybenzoate hydroxylase. Journal of the American Chemical Society 1993, 115 (23), 10958-10963; (b) Perakyla, M.; Pakkanen, T. A., Ab initio models for receptor-ligand interactions in proteins .4. Model assembly study of the catalytic mechanism of triosephosphate isomerase. Proteins-Structure Function and Genetics 1996, 25 (2), 225-236. 17. Rosamond, J.; Allsop, A., Harnessing the power of the genome in the search for new antibiotics. Science (Washington D C) 2000, 287 (5460), 1973-1976. 18. (a) Fleishman, S. J.; Whitehead, T. A.; Ekiert, D. C.; Dreyfus, C.; Corn, J. E.; Strauch, E.- M.; Wilson, I. A.; Baker, D., Computational Design of Proteins Targeting the Conserved Stem Region of Influenza Hemagglutinin. Science 2011, 332 (6031), 816-821; (b) Lassila, J. K.; Baker, D.; Herschlag, D., Origins of by computationally designed retroaldolase enzymes.

35

Proceedings of the National Academy of Sciences of the United States of America 2010, 107 (11), 4937-4942; (c) Siegel, J. B.; Zanghellini, A.; Lovick, H. M.; Kiss, G.; Lambert, A. R.; Clair, J. L. S.; Gallaher, J. L.; Hilvert, D.; Gelb, M. H.; Stoddard, B. L.; Houk, K. N.; Michael, F. E.; Baker, D., Computational Design of an Enzyme Catalyst for a Stereoselective Bimolecular Diels-Alder Reaction. Science 2010, 329 (5989), 309-313. 19. Sali, A.; Kuriyan, J., Challenges at the frontiers of structural biology (Reprinted from Trends in Biochemical Science, vol 12, Dec., 1999). Trends in Cell Biology 1999, 9 (12), M20- M24. 20. Mosier, P. D.; Kellogg, G. E., Molecular modeling: Considerations for the design of pharmaceuticals and biopharmaceuticals. 2008; p 267-291. 21. Lovell, S. C.; Davis, I. W.; Adrendall, W. B.; de Bakker, P. I. W.; Word, J. M.; Prisant, M. G.; Richardson, J. S.; Richardson, D. C., Structure validation by C alpha geometry: phi,psi and C beta deviation. Proteins-Structure Function and Genetics 2003, 50 (3), 437-450. 22. Eisenberg, D.; Luthy, R.; Bowie, J. U., VERIFY3D: Assessment of protein models with three-dimensional profiles. Macromolecular Crystallography, Pt B 1997, 277, 396-404. 23. Melo, F.; Devos, D.; Depiereux, E.; Feytmans, E., ANOLEA: a www server to assess protein structures. Proc Int Conf Intell Syst Mol Biol 1997, 5, 187-90. 24. Chen, V. B.; Arendall, W. B.; Headd, J. J.; Keedy, D. A.; Immormino, R. M.; Kapral, G. J.; Murray, L. W.; Richardson, J. S.; Richardson, D. C., MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallographica Section D-Biological Crystallography 2010, 66, 12-21. 25. Laskowski, R. A.; Macarthur, M. W.; Moss, D. S.; Thornton, J. M., PROCHECK - a program to check the stereochemical quality of protein structures. Journal of Applied Crystallography 1993, 26, 283-291. 26. Nabuurs, S. B.; Wagener, M.; de Vlieg, J., A Flexible Approach to Induced Fit Docking. J. Med. Chem. 2007, 50 (26), 6507-6518. 27. Kontoyianni, M.; McClellan, L. M.; Sokol, G. S., Evaluation of docking performance: Comparative data on docking algorithms. J. Med. Chem. 2004, 47 (3), 558-565. 28. Ewing, T. J. A.; Makino, S.; Skillman, A. G.; Kuntz, I. D., DOCK 4.0: Search strategies for automated molecular docking of flexible molecule databases. J. Comput.-Aided Mol. Des. 2001, 15 (5), 411-428.

36

29. Kramer, B.; Rarey, M.; Lengauer, T., Evaluation of the FLEXX incremental construction algorithm for protein-ligand docking. Proteins-Structure Function and Genetics 1999, 37 (2), 228-241. 30. Verdonk, M. L.; Cole, J. C.; Hartshorn, M. J.; Murray, C. W.; Taylor, R. D., Improved protein-ligand docking using GOLD. Proteins-Structure Function and Genetics 2003, 52 (4), 609-623. 31. Goodsell, D. S.; Morris, G. M.; Olson, A. J., Automated docking of flexible ligands: Applications of AutoDock. Journal of Molecular Recognition 1996, 9 (1), 1-5. 32. Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D. T.; Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.; Francis, P.; Shenkin, P. S., Glide: A new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J. Med. Chem. 2004, 47 (7), 1739-1749. 33. (a) Miteva, M. A.; Lee, W. H.; Montes, M. O.; Villoutreix, B. O., Fast structure-based virtual ligand screening combining FRED, DOCK, and Surflex. J. Med. Chem. 2005, 48 (19), 6012-6022; (b) Pencheva, T.; Soumana, O. S.; Pajeva, I.; Miteval, M. A., Post-docking virtual screening of diverse binding pockets: Comparative study using DOCK, AMMOS, X-Score and FRED scoring functions. European Journal of Medicinal Chemistry 2010, 45 (6), 2622-2628. 34. (a) Jain, A. N., Surflex-Dock 2.1: Robust performance from ligand energetic modeling, ring flexibility, and knowledge-based search. J. Comput.-Aided Mol. Des. 2007, 21 (5), 281-306; (b) Holt, P. A.; Chaires, J. B.; Trent, J. O., Molecular docking of intercalators and groove- binders to nucleic acids using Autodock and Surflex. J. Chem Inf. Model. 2008, 48 (8), 1602- 1615. 35. Li, X.; Li, Y.; Cheng, T.; Liu, Z.; Wang, R., Evaluation of the Performance of Four Molecular Docking Programs on a Diverse Set of Protein-Ligand Complexes. Journal of Computational Chemistry 2010, 31 (11), 2109-2125. 36. Plewczynski, D.; Lazniewski, M.; Augustyniak, R.; Ginalski, K., Can We Trust Docking Results? Evaluation of Seven Commonly Used Programs on PDBbind Database. Journal of Computational Chemistry 2011, 32 (4), 742-755. 37. Wang, R. X.; Lai, L. H.; Wang, S. M., Further development and validation of empirical scoring functions for structure-based binding affinity prediction. J. Comput.-Aided Mol. Des. 2002, 16 (1), 11-26.

37

38. Morris, G. M.; Goodsell, D. S.; Halliday, R. S.; Huey, R.; Hart, W. E.; Belew, R. K.; Olson, A. J., Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. Journal of Computational Chemistry 1998, 19 (14), 1639-1662. 39. (a) Kini, R. M.; Evans, H. J., Comparison of protein models minimized by the all-atom and united-atom models in the amber force-field - correlation of rms deviation with the crystallographic r factor and size. Journal of Biomolecular Structure & Dynamics 1992, 10 (2), 265-279; (b) Popelier, P. L. A.; Aicken, F. M., Atomic properties of amino acids: Computed atom types as a guide for future force-field design. Chemphyschem 2003, 4 (8), 824-829. 40. Brooks, B. R.; Bruccoleri, R. E.; Olafson, B. D.; States, D. J.; Swaminathan, S.; Karplus, M., CHARMM - a program for macromolecular energy, minimization, and dynamics calculations. Journal of Computational Chemistry 1983, 4 (2), 187-217. 41. (a) Jorgensen, W. L.; Blake, J. F.; Buckner, J. K., Free-energy of tip4p water and the free- energies of hydration of ch4 and cl- from statistical perturbation-theory. Chemical Physics 1989, 129 (2), 193-200; (b) Jorgensen, W. L.; Thomas, L. L., Perspective on free-energy perturbation calculations for chemical equilibria. Journal of Chemical Theory and Computation 2008, 4 (6), 869-876. 42. Miyata, T.; Ikuta, Y.; Hirata, F., Free energy calculation using molecular dynamics simulation combined with the three dimensional reference interaction site model theory. I. Free energy perturbation and thermodynamic integration along a coupling parameter. Journal of Chemical Physics 2010, 133 (4). 43. Muegge, I., Effect of ligand volume correction on PMF scoring. Journal of Computational Chemistry 2001, 22 (4), 418-425. 44. (a) Gohlke, H.; Hendlich, M.; Klebe, G., Knowledge-based scoring function to predict protein-ligand interactions. Journal of Molecular Biology 2000, 295 (2), 337-356; (b) Nissink, J. W. M.; Verdonk, M. L.; Klebe, G., Simple knowledge-based descriptors to predict protein-ligand interactions. Methodology and validation. J. Comput.-Aided Mol. Des. 2000, 14 (8), 787-803. 45. DeWitte, R. S.; Shakhnovich, E. I., SMoG: de Novo design method based on simple, fast, and accurate free energy estimates .1. Methodology and supporting evidence. Journal of the American Chemical Society 1996, 118 (47), 11733-11744.

38

46. Park, H.; Bahn, Y. J.; Jung, S.-K.; Jeong, D. G.; Lee, S.-H.; Yoon, T.-S.; Kim, S. J.; Ryu, S. E., Discovery of novel Cdc25 phosphatase inhibitors with micromolar activity based on the structure-based virtual screening. J. Med. Chem. 2008, 51 (18), 5533-5541. 47. Hellmuth, K.; Grosskopf, S.; Lum, C. T.; Wuertele, M.; Roeder, N.; von Kries, J. P.; Rosario, M.; Rademann, J.; Birchmeier, W., Specific inhibitors of the protein tyrosine phosphatase Shp2 identified by high-throughput docking. Proceedings of the National Academy of Sciences of the United States of America 2008, 105 (20), 7275-7280. 48. Nguyen, T. L.; Gussio, R.; Smith, J. A.; Lannigan, D. A.; Hecht, S. M.; Scudiero, D. A.; Shoemaker, R. H.; Zaharevitz, D. W., Homology model of RSK2 N-terminal kinase domain, structure-based identification of novel RSK2 inhibitors, and preliminary common pharmacophore. Bioorganic & Medicinal Chemistry 2006, 14 (17), 6097-6105. 49. Clem, B.; Telang, S.; Clem, A.; Yalcin, A.; Meier, J.; Simmons, A.; Rasku, M. A.; Arumugam, S.; Dean, W. L.; Eaton, J.; Lane, A.; Trent, J. O.; Chesney, J., Small-molecule inhibition of 6-phosphofructo-2-kinase activity suppresses glycolytic flux and tumor growth. Molecular Cancer Therapeutics 2008, 7 (1), 110-120. 50. Cozza, G.; Gianoncelli, A.; Montopoli, M.; Caparrotta, L.; Venerando, A.; Meggio, F.; Pinna, L. A.; Zagotto, G.; Moro, S., Identification of novel protein kinase CK1 delta (CK1[delta]) inhibitors through structure-based virtual screening. Bioorganic & Medicinal Chemistry Letters 2008, 18 (20), 5672-5675. 51. Kellenberger, E.; Springael, J. Y.; Parmentier, M.; Hachet-Haas, M.; Galzi, J. L.; Rognan, D., Identification of nonpeptide CCR5 receptor agonists by structure-based virtual screening. J. Med. Chem. 2007, 50 (6), 1294-1303. 52. Dooley, A. J.; Shindo, N.; Taggart, B.; Park, J.-G.; Pang, Y.-P., From genome to drug lead: Identification of a small-molecule inhibitor of the SARS virus. Bioorganic & Medicinal Chemistry Letters 2006, 16 (4), 830-833. 53. Hotez, P. J.; Molyneux, D. H.; Fenwick, A.; Kumaresan, J.; Sachs, S. E.; Sachs, J. D.; Savioli, L., Current concepts - Control of neglected tropical diseases. New England Journal of Medicine 2007, 357, 1018-1027. 54. Smith, D. H.; Pepin, J.; Stich, A. H. R., Human African trypanosomiasis: an emerging public health crisis. British Medical Bulletin 1998, 54 (2), 341-355.

39

55. (a) Checchi, F.; Piola, P.; Ayikoru, H.; Thomas, F.; Legros, D.; Priotto, G., Nifurtimox plus Eflornithine for Late-Stage Sleeping Sickness in Uganda: A Case Series. PLoS Negl. Trop. Dis. 2007, 1 (2), e64; (b) Priotto, G.; Kasparian, S.; Mutombo, W.; Ngouama, D.; Ghorashian, S.; Arnold, U.; Ghabri, S.; Baudin, E.; Buard, V.; Kazadi-Kyanza, S.; Ilunga, M.; Mutangala, W.; Pohlig, G.; Schmid, C.; Karunakara, U.; Torreele, E.; Kande, V., Nifurtimox-eflornithine combination therapy for second-stage African Trypanosoma brucei gambiense trypanosomiasis: a multicentre, randomised, phase III, non-inferiority trial. The Lancet 2009, 374 (9683), 56-64. 56. Pollastri, M. P.; Campbell, R. K., Target repurposing for neglected diseases. Future Med. Chem. 2011, 3 (10), 1307-1315. 57. Frearson, J. A.; Brand, S.; McElroy, S. P.; Cleghorn, L. A. T.; Smid, O.; Stojanovski, L.; Price, H. P.; Guther, M. L. S.; Torrie, L. S.; Robinson, D. A.; Hallyburton, I.; Mpamhanga, C. P.; Brannigan, J. A.; Wilkinson, A. J.; Hodgkinson, M.; Hui, R.; Qiu, W.; Raimi, O. G.; van Aalten, D. M. F.; Brenk, R.; Gilbert, I. H.; Read, K. D.; Fairlamb, A. H.; Ferguson, M. A. J.; Smith, D. F.; Wyatt, P. G., N-myristoyltransferase inhibitors as new leads to treat sleeping sickness. Nature 2010, 464 (7289), 728-U100. 58. Kuzoe, F. A. S., current situation of african trypanosomiasis. Acta Tropica 1993, 54 (3-4), 153-162. 59. Rose, P. W.; Beran, B.; Bi, C.; Bluhm, W. F.; Dimitropoulos, D.; Goodsell, D. S.; Prlic, A.; Quesada, M.; Quinn, G. B.; Westbrook, J. D.; Young, J.; Yukich, B.; Zardecki, C.; Berman, H. M.; Bourne, P. E., The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids Research 2011, 39, D392-D401. 60. (a) Berman, H. M.; Bhat, T. N.; Bourne, P. E.; Feng, Z. K.; Gilliland, G.; Weissig, H.; Westbrook, J., The Protein Data Bank and the challenge of structural genomics. Nature Structural Biology 2000, 7, 957-959; (b) Berman, H. M.; Westbrook, J. D., The impact of structural genomics on the protein data bank. American journal of pharmacogenomics : genomics-related research in drug development and clinical practice 2004, 4 (4), 247-52; (c) Westbrook, J.; Feng, Z. K.; Chen, L.; Yang, H. W.; Berman, H. M., The Protein Data Bank and structural genomics. Nucleic Acids Research 2003, 31 (1), 489-491. 61. Lichtarge, O.; Bourne, H. R.; Cohen, F. E., An evolutionary trace method defines binding surfaces common to protein families. Journal of Molecular Biology 1996, 257 (2), 342-358.

40

62. Sankararaman, S.; Sjoelander, K., INTREPID-INformation-theoretic TREe traversal for Protein functional site IDentification. Bioinformatics 2008, 24 (21), 2445-2452. 63. (a) Devos, D.; Valencia, A., Practical limits of function prediction. Proteins: Structure, Functing and Genetics 2000, 4, 98-107; (b) Wilson, M. A., C.V. St. Amour, J.L. Collins, D. Ringe and G.A. Petsko, The 1.8 A resolution crystal structure of YDR533Cp from Saccharomyces cerevisiae: A member of the DJ-1/ThiJ/PfpI superfamily. Proc Natl Acad Sci U S A 2004, 101, 1531-1536. 64. Rost, B., Enzyme function less conserved than anticipated. Journal of Molecular Biology 2002, 318 (2), 595-608. 65. Hendlich, M.; Rippmann, F.; Barnickel, G., LIGSITE: Automatic and efficient detection of potential small molecule-binding sites in proteins. Journal of Molecular Graphics & Modelling 1997, 15 (6), 359-+. 66. Laskowski, R. A., SURFNET - A program for visualizing molecular-surfaces, cavities, and intermolecular interactions. Journal of Molecular Graphics 1995, 13 (5), 323-&. 67. Binkowski, T. A.; Naghibzadeh, S.; Liang, J., CASTp: Computed atlas of surface topography of proteins. Nucleic Acids Research 2003, 31 (13), 3352-3355. 68. Capra, J. A.; Laskowski, R. A.; Thornton, J. M.; Singh, M.; Funkhouser, T. A., Predicting Protein Ligand Binding Sites by Combining Evolutionary Sequence Conservation and 3D Structure. Plos Computational Biology 2009, 5 (12). 69. Laurie, A. T. R.; Jackson, R. M., Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics 2005, 21 (9), 1908-1916. 70. (a) Murga, L. F.; Ko, J.; Wei, Y.; Ondrechen, M. J., Central moments based statistical analysis for the determination of functional sites in proteins with thematics. 2006; p 215-224; (b) Wei, Y.; Ko, J.; Murga, L. F.; Ondrechen, M. J., Selective prediction of interaction sites in protein structures with THEMATICS. Bmc Bioinformatics 2007, 8. 71. (a) Somarowthu, S.; Yang, H.; Hildebrand, D. G. C.; Ondrechen, M. J., High- performance prediction of functional residues in proteins with machine learning and computed input features. Biopolymers 2011, n/a-n/a; (b) Tong, W. X.; Wei, Y.; Murga, L. F.; Ondrechen, M. J.; Williams, R. J., Partial Order Optimum Likelihood (POOL): maximum likelihood prediction of protein active site residues using 3d structure and sequence properties. Plos Computational Biology 2009, 5 (1).

41

CHAPTER 2 IDENTIFICATION OF NEW DRUGS FOR THE TREATMENT OF HUMAN AFRICAN TRYPANOSOMIASIS BASED ON COMPARATIVE MODELING OF TbAUK1

42

2.1 Introduction Mammalian genomes code for three Aurora kinases, Aurora-A, Aurora-B, and Aurora-C.

They share highly conserved catalytic domains, but are radically different in their subcellar localizations. 1 The Aurora kinases are essential for cell division and their overexpression appears to be closely linked to centrosome amplification, tumor genesis and transformation. 2

Aurora A has been reported to be oncogenic and its overexpression has been correlated with loss of mitotic check point control, chromosome instability, aneuploidy and the formation of multiple solid tumors. 2 Aurora B catalyzes the phosphorylation of histone H3 during mitosis and has been reported to regulate chromosome condensation. 3 These Aurora kinases have drawn attention due to their enormous potential in disease treatment. 3b The trypanosomal homologs of human Aurora kinases, the Trypanosoma brucei Aurora kinases (TbAUKs), have been reported to be promising targets for trypanosomiasis control. 4 With the set of inhibitors developed to target human Aurora kinase for cancer treatment, TbAUKs provide an opportunity to repurpose the extensive knowledge in human Aurora medicinal chemistry towards the development of

TbAUK inhibitors.

In Trypanosoma brucei, three homologs are present, namely, Trypanosoma brucei

Aurora kinase-1 (TbAUK1), Aurora kinase-2 (TbAUK2), and Aurora kinase-3 (TbAUK3).

Among these proteins, TbAUK1 has been validated as a viable target for trypanosomiasis treatment. 1b RNA interference experiments have demonstrated that TbAUK1, but not TbAUK2 or TbAUK3, is required for mitotic progression. The loss of TbAUK1 inhibits nuclear division, cytokinesis and growth in the cultured infectious bloodstream form (BF) and insect stage procyclic form (PF). 4

TbAUK1 is susceptible to human Aurora kinase inhibitors. Hesperadin5 has displayed inhibition of TbAUK1 both in vitro kinase assays and in cell growth experiments on cultured

43

BF.4 Recently, we observed danusertib (PHA-739358) showed inhibition against TbAUK1 in the sub-micromolar range (>1 μM). Danusertib inhibits human Aurora A kinase and is in Phase

II clinical trials for solid and hematologic malignancies. 6 The same compound also showed inhibition activity in vitro for Abl, a kinase target for chronic myelogenous leukemia treatment. 7

Currently, computational methods have been actively applied for structure based drug design for kinases. For example, Guimaraes and coworkers at Pfizer have studied the sequences and structure of 442 kinases and the conformational change in the P-loop in order to rationalize the impact of the conformational changes on drug selectivity. 8 Daniel et al. have developed an algorithm called S-Filter that utilizes sequence and structural information to predict specificity- determining residues and kinase selectivity profiles for drugs. The method has been validated using known specificity determinants from kinases and further predicted novel specificity determinants for testing. 9

Computational methods also have been assessed and advanced in kinase studies. For example, Perola et al. at Vertex reported a study of three docking protocols, GOLD, GLIDE, and

ICM, to assess the reliability of reproducing ligand-protein complex structures, including those of p38 MAP kinase. 10 Verdonk et al. of Astex Therapeutics Ltd. reported a docking study of ligands against non-native conformations of proteins, e.g. the apo structure or a structure from a different protein−ligand complex. A set of proteins including kinases such as CDK2, p38, Chk1 and c-Abl tyrosine kinases11 were investigated with the GOLD docking program. Recently,

Tuccinardi et al. used 12 docking protocols, including Glide, FRED, AutoDock 4 and GOLD to conduct exhaustive docking experiments to assess their reliability in more than 700 high- resolution kinase structures. Based on the results of self-docking and cross-docking of these kinase structures, the authors confirmed that decreasing reliability in the cross docking

44 experiments were caused by conformational changes of the kinase proteins. They proposed an alternative homology modeling method for kinases based on the ligand similarity in the active sites, instead of a more conventional method based on protein sequence similarity. 12 In sum, these studies have improved the protocols for modeling and docking studies of kinases and increased the reliability of these computational techniques for kinase inhibitor design.

In this chapter, I have applied the computational methods to give guidance for compound repurposing towards trypanosomiasis treatment, Stefan Ochiana performed helped the compound synthesis and Prof. Larry Ruben’s group at Southern Methodist University performed the biological assays. Since the absence of a crystal structure of TbAUK1 has hindered structure- based inhibitor design for the target protein, a comparative model of TbAUK1 has been established and validated. Docking studies have helped design and prioritize the synthesis of target compounds based on a series of danusertib analogs. Ligand binding residues of TbAUK1, computationally predicted by THEMATICS and POOL, have offered further structural based insights for design of inhibitor affinity and selectivity. Newly synthesized inhibitors have shown good potency in the cell proliferation assay and improved selectivity over human cells.

45

2.2 Methods

2.2.1 Homology modeling of T. brucei Aurora kinases

The sequence of TbAUK1 (Tb11.01.0330) was searched against the Protein Data Bank

(PDB) (http://www.rcsb.org/) using PSI-BLAST13. The homology model of the catalytic domain of TbAUK1 (FASTA sequence numbers: 28-219) was built using YASARA14, a top-performing program in the comparative modeling category of the CASP8 competition15. The protein structures in the PDB with high sequence similarity to the target TbAUK1 protein were scored, based on their sequence alignment score and on the crystal structure resolution. The highest- scoring structures were then selected as the templates for homology modeling. A total of six initial homology models were then built based on each combination of template and alignment.

Finally, YASARA performed a high-resolution refinement for each model structure. Each model was then validated and assigned a quality score. An attempt was then made to improve the model with the highest quality score by replacing low-scoring regions in its structure with those of other models with higher scores in the same local region. This hybrid model was then subjected to further refinement and the quality scores of all models were compared once again to identify the highest-scoring model of all. In the case of TbAUK1, the highest-scoring model was a hybrid model based on a human Aurora A crystal structure (PDB ID 2bmc) 16 and a mouse

Aurora A crystal structure (PDB ID 3d14)17 as the templates. Similarly, The TbAUK2 and

TbAUK3 homology models were build using YASARA. The best-ranked TbAUK2 was built based on the template of human Aurora A complexed with AMP (PDB ID: 1MQ4). the best- ranked model of TbAUK3 was built based on human Aurora A as the template (PDB ID: 3fdn).

46

2.2.2 Model refinement and evaluation

The resultant homology model has a danusertib analog (PHA6806326) positioned in the active site cleft. To relieve interatomic clashes within the structure and to further refine the geometry and orientation of both the bound ligand as well as to further tune the rotamers of the side chains that interact with the ligand, the comparative models of TbAUK1 were refined for

500 ps using explicit solvent molecular dynamics with the YAMBER3 force field in YASARA. and MolProbity.

2.2.3 Active site prediction with THEMATICS and POOL

THEMATICS (Theoretical Microscopic Anomalous Titration Curve Shapes) is based on computed theoretical proton occupation functions, similar to experimental titration curves, obtained from a calculated electrostatic potential function. The shapes of these curves may be used to predict the functionally important residues in protein. THEMATICS predicts seven ionizable residue types (Arg, Asp, Cys, Glu, His, Lys, and Tyr), with only the 3D structure coordinate file of the query protein as input. Except for the model structures which were obtained as described above, the 3D coordinates of the selected proteins were downloaded from the PDB (http://www.rcsb.org)18 unless otherwise noted. Prior to the THEMATICS/POOL calculations, the structures were preprocessed to delete duplicated chains and to add hydrogen atoms and any missing heavy atoms using the YASARA19 suite of molecular modeling programs.

THEMATICS analysis was performed on the protein structures (homology models and crystal structures) following the procedures summarized by Ko et al. and Wei et al., using a Z score cutoff of 0.99 in the statistical analysis and a distance cutoff of 9.0 Å. 20,21 Only clusters with two or more residues were considered predictive.

47

POOL (Partial Order Optimal Likelihood)22 is a monotonicity-constrained, maximum likelihood machine learning methodology and it predicts all 20 amino acid types. In this study, the input features for POOL include THEMATICS features (μ4 and buffer range)23, phylogenetic tree-based evolutionary information from INTREPID24, and surface geometric features from

(structure only) ConCavity25. POOL generates a rank-ordered list of all residues in the protein structure, in order of their computed probability of functional importance. For present purposes, the residues that are predicted to be the most likely to bind a small molecule inhibitor are those in the top 8% of the POOL ranking. A structural alignment of the TbAUK1 model to the available mammalian crystal structures was performed using PDBefold (http://www.ebi.ac.uk/msd- srv/ssm/ssmstart.html)26.

2.2.4 Docking experiment for TbAUK1 model

The model TbAUK1 structures were prepared using the Maestro 9.1 protein preparation wizard (Schrodinger, LLC, 2010, New York, NY); bond orders were assigned and the orientation of hydroxyl groups, amide groups of Asn and Gln, and the charge state of His residues were optimized to achieve their local energy minima. Moreover, a restrained minimization of the protein structure was performed using the default constraint of 0.3 Å RMSD and the OPLS 2001 force field27. The docking experiments were performed using Glide 3.5 with standard precision

(SP) and/or extra precision (XP) mode without further minimization.

The reliability of the docking method was tested by self-docking, i.e., re-docking the ligands into their original bound crystal structures. Then the RMSD between the non-hydrogen ligand atoms of the lowest-energy docked poses and the experimental bound crystal structures were calculated. This RMSD was used to measure the accuracy of the docking experiments. A

48 docked pose with RMSD less than 2Å relative to the ligand in the experimental structure has been generally considered to be an accurate pose 48-49.

A compound library based on the danusertib scaffold was generated based on 208 arylacetic acids that were available in pre-weighed quantities from a commercial vendor (ASDI,

Inc). The compounds' properties were calculated and the non-Lipinski compliant compounds28 were filtered out (66 remaining) to select the compounds with desirable ADMET properties.

Furthermore, molecular fingerprints were generated and used to determine the compound similarity. Fifty compounds were selected based on maximal diversity to carry out docking experiment.

The 3D coordinates of the compounds of interest were generated using the ligprep utility in Maestro 9.0 to create the appropriate tautomer/ionization states of the ligand. The docking experiments were conducted with the constraint that at least one H-bond must be formed between the ligand and hinge region of the Aurora kinases.

2.2.5 Prime MM/GBSA free energy calculation

To obtain a better estimation of the binding free energies of a ligand to a protein,

Molecular Mechanics Generalized Born Surface Area (MM GB/SA) was calculated with the

Prime MM GB/SA utility (Schrodinger, LLC). The free energy of binding is calculated as

Δ=ΔGEMM++ Δ G solv Δ G SA (1)

EMM stands for the molecular mechanics contribution in vacuo, consisting of the sum of internal energy terms (bonds, angles, and dihedrals), electrostatic, and van der Waals energies (Lennard-

49

Jones); where ΔEMM is the difference in energy between the protein-ligand complex and the sum of the energies of the ligand and ligand-free protein (or apo protein), using the OPLS force field.

Gsolv is the contribution of solvation free energies expressed as the sum of polar (ΔGsolv, elec ) and

non-polar (ΔGnp ) solvation free energies as shown in equation (2);

ΔΔGGsolv=+ solv,(2) elec Δ Gnp

ΔGsolv is the difference in the GBSA solvation energy of the complex and the sum of the solvation energies for the ligand and ligand-free protein.

ΔGSA is the difference in the surface area energy for the complex and the sum of the surface area energies for the ligand and ligand-free protein. Corrections for entropic changes were not applied.

The docked protein-ligand poses were generated from Glide XP and only the top two poses for each ligand were selected and GBSA simulation were carried out using the GBSA continuum model in Prime29 (Schrödinger, LLC, New York, NY, 2008). For each molecule the best scoring pose was selected for comparison with the experimental EC50 data.

2.2.6 Ligand similarity analysis

The compounds' similarity was measured using the ROCS 3.0 software (OpenEye

Scientific software), a shape-similarity method base on the Tanimoto-like overlap of volumes30.

The conformation of the ligand used as a reference was directly extracted from the available x- ray crystal structures. The tested compound conformations were generated using OMEGA231 with the maximum number of conformations (10,000). The ligand similarity was assessed using the TanimotoCombo score in ROCS, a similarity score indicates ligand shape ranging from 0 to2.

50

32 This score combines the Tanimoto shape score with the color score for the appropriate spatial overlap of groups with similar properties (donor, acceptor, hydrophobicity, cation, anion and ring). The default values of ROCS calculations were used for all other parameters.

2.3. Results and Discussion

2.3.1 The sequence and 3D structure of the TbAUK1 models

TbAUK1 shares 43% sequence identity to human Aurora A and 41 % sequence identity to human Aurora B, with 88% sequence coverage by these crystal structures. It has been noted that the catalytic domains of human Aurora-A and B share 76% sequence identity and the two proteins only differ by three amino acids in their ATP-binding pockets. Noting that Aurora B is difficult to crystallize due to its physicochemical properties and as a result there are no wild type human Aurora B crystal structures available in the PDB. The availability of the crystal structures and the higher than 30% sequence identity to known structures suggest that it is feasible to build

TbAUK1 models based on human Aurora crystal structures using homology modeling.

A kinase protein such as TbAUK1 can be more flexible than a non-kinase protein and can undergo multiple conformational changes with different bound ligands. Hence it is necessary to select a template not only with a high sequence identity to the target sequence, but also a template with a similar binding pocket12, i.e., a protein structure that is able to bind to congeneric ligands of interest. We studied the known mammalian Aurora A crystal structures with danusertib or its analogs bound (PDB ID: 2j4z6, 2j506 and 2bmc33). These structures are likely to yield the best homology models of TbAUK1 for the purposes of investigating danusertib analogs.

Indeed, the final, top ranked homology model is a hybrid model based on two structures, a human Aurora A structure in complex with a 1,4,5,6-tetrahydropyrrolo [3,4-c] pyrazole

51 derivative (PDB ID: 2bmc) and a triple mutant mouse Aurora A (N186G / K240R / M302L) in complex with 1-{5-[2-(thieno[3,2-d]pyrimidin-4-ylamino)-ethyl]- thiazol-2-yl}-3-(3- trifluoromethyl-phenyl)-urea(PDB ID:3d14). The ligands co-crystallized in the two crystal structures share a high similarity in shape, indicated by a high ROCS TanimotoCombo score of

1.2 ( TanimotoCombo score greater than 0.85 is similarity cutoff). These ligands with similar shape could induce similar binding pockets in a protein. Therefore, the hybrid homology model generated from the two crystal structures is assumed to retain a similar binding pocket as the templates and is suitable to interact with a 1,4,5,6-tetrahydropyrrolo[3,4-c]pyrazole analogs as suggested by in vitro experiment. In the biochemical in vitro experiments danusertib is proved to inhibit TbAUK1 with an IC50 in the low µM range.

2.3.2 Structure of the comparative models

The homology model of TbAUK1, shown in Figure 2-1, represents a conventional kinase

3D structure, including an N-lobe, a C-lobe, and an ATP binding pocket located between the clefts between the N and the C lobes. The activation loop of TbAUK1 is similar to human

Aurora kinases, with a consensus sequence of FGWSxxxxxxxRxTxCGTxDYLPPE. The kinases are presumably activated via phosphorylation of this sequence. The binding pocket for TbAUK1 consists of five subareas, including: (1) the kinase hinge region; (2) the solvent accessible region;

(3) the sugar-binding region; (4) the phosphate-binding region; and (5) a small-buried region.

These subareas have been observed to be conserved across more than 500 kinases.

52

Figure 2-1. The structure of the TbAUK1 comparative model

53

2.3.3 Evaluation of the TbAUK1 model structure To validate the quality of the TbAUK1 model structure, several bioinformatics tools such as PROCHECK and MolProbity were applied. A Ramachandran plot was generated using

PROCHECK34 and provided information on the stereochemical quality of a protein structure.

Based on an analysis of 118 experimental structures at 2.0 Å resolution, it has been reported that a good quality model would be expected to have at least 90% of the residues in the favored regions of the Ramachandran diagram and ideally no residues outside the allowed regions.

Figure 2-2 shows the Ramachandran plot of the TbAUK1 model, wherein 93.2% (177/190) of all residues are located in the favored regions and 100% (190/190) of the residues are located in the allowed regions, suggesting the model has good stereochemistry quality. Thus the model structure meets the PROCHECK standards for a good quality model.

MolProbity provides a detailed inspection of the atom contact information and protein geometry, by reporting residues with poor geometric properties35. The results of the MolProbity evaluation show only four residues with problems: Leu149 and Leu168 have bad rotamers,

Lys183 has a bad Cβ deviation, and Pro194 has bad angles. An inspection of the 3D structure of the model revealed that Leu149 is not located in the active site of TbAUK1 and thus is unlikely to be important. THEMATICS and POOL were applied to predict the functionally important residues in the active site of TbAUK1 and these important active site residues are shown in

Table 2-1. None of the four residues reported by MolProbity to have unfavorable geometric properties were predicted to be functionally important residues nor were they contact residues for the compounds of interest. Therefore, based on these assessments, this model is deemed to be of sufficiently good quality to inform ligand-binding studies; this assumption is tested below.

54

Table 2-1. The predicted important residues in the TbAUK1 model using THEMATICS and

POOL (with 8% cutoff)

THEMATICS POOL

GLU 77, ASP 170 , HIE 150, GLU 77, ASP 77, LYS 58,

ASP 152, ARG 151, ARG ARG151, HIS150, ASP 152,

181, TYR 202, CYS 109, TRP 173, TYR 202, ARG 181,

LYS 167 CYS 109, ALA 169, TYR 40,

PHE 171, GLN 172, ILE 70

Figure 2-2. The Ramachandran plot of the TbAUK1 model

55

A B

tetrahedral geometry

Figure 2-3. (A) An illustration of 4,5,6-tetrahydropyrrolo[3,4-c] pyrazole derivatives (danusertib analogs) and their interactions with the Aurora kinases active site subareas. Hydrogen bonds are shown by dashed lines. The tetrahydro carbon is pointed by an arrow (B) a list of first generation danusertib analogs assayed in the experiments.

56

2. 3. 4 The docking study of homology models

First the docking protocol was evaluated via docking a list of literature reported danusertib analogs for human Aurora A (PDB ID: 2j50) to validate the reproducibility and reliability of the docking protocol. A set of eight danusertib analogs with reported IC50 values (in human Aurora A) 6 were docked to the human Aurora A (PDB ID: 2j50). The docking study successfully reproduced the binding poses of these compounds in human Aurora A as reported in crystal structures. The root means squared deviation (rmsd) of a docked ligand compared the to the crystallographic pose is less than 1Å, as a good indicator. It confirms the reproduce ability and reliability of the docking protocol.

Two benchmark compounds (danusertib and PHA680632) in addition to three designed danusertib analogs (NEU220, NEU221 and NEU238, see Figure 2-3B) were tested by trypanosome growth assay in vitro. These ligands were docked into TbAUK1 using the validated docking protocol .

Multiple conformations of kinases have been observed, depending on the type of compound bound to it36. The compounds of interest in Figure 2-3B are all danusertib analogs.

Thus, it is reasonable to consider the TbAUK1 model can be a DFG-in conformation36 as the key interacts between danusertib analogs and TbAUK1 are the H-bonds in the hinge regions and no extra H-bonds are formed between the ligand and DFG loops of the TbAUK1 kinases.

The predicted poses of danusertib analogs interacting with the TbAUK1 model, as determined by docking studies, sustain the key interactions observed in the crystal structures

(PDB ID: 2j4z) of the danusertib/Aurora A complex. As illustrated in Figure 2-3A, H-bonds are formed between the hinge region of TbAUK1 and the ligand, through the interactions between

57 the backbone carbonyl group of Glu107 and the backbone amide group of Cys109 to the nitrogen atoms in the 3-aminopyrazole moiety (NH2-C-N-NH) within the 1,4,5,6-tetrahydropyrrole [3,4-c] pyrazole. The pyrazinylphenyl tail is oriented towards the solvent accessible region and the R group is positioned towards the phosphate binding pocket. A hydrogen atom (circled in the green region and shown near the top of Figure 2-3A) is pointed toward the small buried region that cannot accommodate a bulky group. The reasonable docking poses obtained for the danusertib analogs further suggest that TbAUK1 is able to interact favorably with the known danusertib compounds; this is consistent with the experimental observations. To further assess the predictive capability of the homology model, we compared the scores from Glide docking with the EC50 data from the cell proliferation assay. The IC50 data from a protein inhibitory assay is directly related to the affinity between a protein and a ligand37. However, the only available value for the IC50 is for danusertib (0.2 μM). The cell based EC50 data have been obtained for the other compounds and these are used in the present study due to the difficulty of the TbAUK1 protein purification.

Figure 2-4 shows a scatter plot of the Glide docking scores as a function of the EC50 for the compounds shown in Table 2-2, excluding NEU-238, which has poor solubility. It shows

2 that the docking score is linearly correlated with the EC50 with an R = 0.9. A lower (negative value) Glide docking score generally corresponds to a lower EC50 in cell proliferation assays; for instance, the compound with the lowest Glide docking score (-10.34 kcal/mol) corresponds to the best cell inhibition measurement with an EC50 value of 550 nM. On the other hand, a higher

Glide score corresponds to a higher EC50 value.

It was expected that the binding free energy calculated using Prime MM/GBSA would provide a better estimation of the affinity between the protein and ligand and thus a better

58

38 correlation of log (IC50) (or in some case log (Ki)). The Prime MM/GBSA energies provided

6 superior correlation with the reported IC50 values of human Aurora A and danusertib analogs .

Considering that the Prime MM/GBSA method is system- and protocol- dependent, we examined whether the binding free energy calculated using Prime MM GB/SA could give a good correlation with the experimental data for the danusertib analogs and human Aurora A. In

Figure 2-5A), there is almost no quantitative correlation between the Glide docking scores and experimental determined log (EC50) for these analogs. A linear correlation can be observed between the binding free energy calculated using MM-GB/SA and experimentally determined

2 log (EC 50), with an R value of 0.61 (Figure 2-5B). Therefore, the MM-GBSA method gives a better correlation between the binding free energy and the experimental data for danusertib analogs and human Aurora A. This result suggests it may worthwhile to calculate the binding free energy using Prime MM GB/SA between danusertib and the homology model for TbAUK1, a homologous protein of Aurora A.

In our experiment, Glide docking scores can be better correlated to cell-based inhibitory values (EC50), at least for the cases tried thus far. Nevertheless, we see that the glide docking score and/or the free energy calculation can provide an estimation of the compound activity based on their correlations with cell-based inhibitory values (EC50). Therefore, the model was used to provide in silico screening for ligands, in order to prioritize additional generations of compounds for synthesis and testing.

59

Table 2-2. The docking score and binding free energy calculated using Prime MM/GBSA for danusertib and its derivative compounds, corresponding to the EC50 in the T. brucei anti- proliferation assay Compounds T.brucei TbAUK1 glide Binding free energy

docking score calculated using

Growth Inh. MM-GBSA (kcal/mol) EC50- (µM)

1 Danusertib 0.55 -10.34 -45.79

2 NEU174 3.9 -9.18 -44.6

(PHA680632)

3 NEU238 11.12 -8.49 -43.41

4 NEU221 5.7 -8.04 -33.95

5 NEU220 6.1 -7.68 -33.69

60

EC50 of ligands in T .b. b. (μM)

‐4 01234567 ‐5 R² = 0.90 ‐6

‐7

‐8

‐9

‐10 Glide docking score docking Glide ‐11

Figure 2-4. Docking score for Glide SP plotted against EC50 data from cell proliferation assay for the compound in Table 2-2, excluding NEU-238.

(medium flexibility )

C p IC50 C p IC50 Glide Δ G

(predicted)

G ‐ score

Figure 2-5. (A) Glide score vs. pIC50 and (B) predicted ΔG binding energy vs. pIC50 for the compounds in the literature39.

61

2.3.5 Compound prioritization via docking

We designed a library of danusertib analogs that retained the tetrahedral geometry on the carbon atom adjacent to the carbonyl group (see Figure 2-3B), in order to further explore the head-group region. Fifty danusertib analogs were designed and subjected to docking study for virtual screening and a total of 224 docked poses were obtained. The top 22 compounds were selected, corresponding to the compounds ranked in the top 10% of the poses. These 22 compounds were predicted to bind to TbAUK1 in a manner similar to that of danusertib (see

Figure 2-3), retaining the key interactions between the ligands and TbAUK1 groups in the . The Cys109 and Glu107 residues in the hinge region of the protein interact with the

3-aminopyrazole moiety. The Lys58 side chain can H-bond with carbonyl groups on the ligand and is also capable of forming a cation- π interaction between the aromatic aryl head group of the ligand (see in Figure 2-6).

62

A

B

Figure 2-6. (A) The superimposition of the top ranked 22 danusertib analogs docked in the TbAUK1 model, suggesting a similar binding mode as described before. Lys58 is colored magenta; (B) An illustration of the 20 target analogs that have been synthesized and tested against T brucei and MOLT-4 cell cultures.

63

2.3.6 The potency and selectivity of the synthesized compounds

In the prioritized synthesis list, 20 compounds (see Figure 2-6B) out of 22 were successfully synthesized and subjected to experimental tests. Initially, we tested these compounds at 1 and 10 μM against T. brucei cell cultures. Those compounds showing better than 60% inhibition at 1 μM were selected for EC50 analysis against T. brucei brucei (Lister 427

90-13 and AnTat1.1A., abbreviated T .b .b.) and the human infective strain T. brucei rhodesiense

(YTAT1.1 strain, abbreviated T .b .r.). In addition, these ligands were assessed for toxicity against MOLT-4 cells. 40 These results are summarized in Table 2-3. Seven out of the 20 compounds show EC50 values less than 1 μM. Four more compounds show EC50 values in the 1-

2 μM range.

Table 2-3 also shows the docking scores of 11 of the newly synthesized compounds in addition to two original hit compounds (danusertib and PHA680632) docked against the

TbAUK1 model structure. The docking scores for the newly synthesized compounds with

TbAUK1 are in the range of -9.4 to -10.1 kcal/mole, corresponding to compounds predicted to be active by the docking procedure. The EC50 values for the newly synthesized compounds against

T. b. b. and T. b. r. are all in the low μM range. It is consistent with the docking predictions that the newly designed compounds obtained similar potency. Nevertheless, it is important to note that lower EC50 values were not obtained for the initial derivatives of danusertib, i.e. NEU220,

NEU221 and NEU238. Using the TbAUK1 model for virtual screening, a set of ligands with potency in the low μM range was achieved rapidly.

As observed in Table 2-3, three of the newly designed danusertib analogs that obtained

EC50 values less than 1 μM, shared a hydrophobic R group, i.e. NEU335, NEU336, and NEU328.

64

Therefore, the next generation library ligands were designed with several different hydrophobic

R groups and initial screening data suggest that these ligands have good potency, although final results were not ready at the time of this writing.

Importantly, an improved selectivity between parasite cells and MOLT-4 cells40 was observed for some of the newly synthesized compounds (see Table 2-3). The selectivity of these compounds has been quantified by the selectivity index calculated from the ratio of the inhibitory data (EC50) of T .b. r. / human cells. The initial hit compound (danusertib) has an EC50 value of

0.6 μM against T. b. b. cells and an EC50 value of 0.15 μM against T. b. r. cells, but an EC50 value of 0.15 μM against MOLT-4 cells. The selectivity index for danusertib is 1. On the other hand, NEU174 has an EC50 value of 4 μM against T. b. b. cells, an EC50 value of 1.25 μM against T. b. r. cells, and an EC50 value of 0.22 μM against MOLT-4 cells, corresponding to a selectivity index of 0.18 for NEU174. A 25-fold improvement in selectivity for killing T. b. r. over MOLT-4 cells has been observed in the case of NEU327. In fact, eight of the analogs in

Table 2-3 (Entries # 3-10) have selectivity for parasite cells over human cells, ranging from 1.1-

25. This shows that selectivity for parasite cells over host cells is achievable.

65

Table 2-3. Dose-response experiments on the parallel array of analogs of danusertib

c R1 Ar T. b.b. T.b.r. MOLT-4 Selectivity

EC50(μM) Compound EC50(μM) EC50(μM) MOLT./ T.b.r Glide docking score 1 Danusertib OMe phenyl 0.6 0.15 0.15 1.00 -10.34

2 NEU174 NH 2,6-diethylpenyl 4 1.25 0.22 0.18 -9.18

3 NEU327 H 2-napthyl nd 0.61 14.25 23.4 -9.4

4 NEU336 H 2,3,6-trifluorophenyl nd 0.32 2.22 6.9 -9.45

5 NEU338 d OMe phenyl nd 0.61 4.13 6.8 -9.45

6 NEU328 H 3-Cl-Ph 0.85 0.58 4 6.9 -9.80

7 NEU343 Me 4-methylphenyl nd 0.86 5.48 6.4 -10.1

8 NEU340 H 3,5-dimethylphenyl 2.56 1.04 4.46 4.3 -9.43

9 NEU334 H 2,5-dimethylphenyl nd 1.2 2.65 2.2 -9.50

10 NEU341 H 2,3,5-trifluorophenyl nd 2 2.31 1.2 -9.50

11 NEU339 H 3-(2-methylindolyl) nd 1.2 1.16 1.0 -10.10

12 NEU335 d iPr phenyl 0.67 0.4 0.25 0.63 -9.70

13 NEU333 H 3,5-difluorophenyl 0.59 0.91 0.63 0.7 -9.70 d indicates compounds tested as racemate

66

We hypothesized that the difference in the residues in the vicinity of the active site between TbAUK1 and human Aurora A would be an important factor in the quest for compound selectivity for parasite over human protein. To test this hypothesis a structural alignment was used of the active site residues of human Aurora A and TbAUK1 predicted from

THEMATICS/POOL calculations. In Figure 2-7C, the structurally aligned residues show significant differences in the hinge region; Phe108, Cys109, Ser110 and Asn111 in TbAUK1 correspond to Tyr212, Ala213, Phe214 and Leu215 in the human Aurora A. However, as the side chains of the amino acids in the hinge region tend to point to the outside of the binding pocket, H-bond networks are formed between the ligand and the backbone of these residues, rather than with the side chains. This suggests that the difference in the identities of these amino acids between human and parasite is unlikely to confer specificity of compounds.

Except for the hinge region residues, the majority of the residues in the binding pocket of

TbAUK1 and human Aurora A are similar. It is especially noteworthy that the sizes of the two hydrophobic pockets that directly contact the R groups are similar. One hydrophobic pocket on the N-lobe is composed of Gly36, Gly38 and Val43 in the TbAUK1, corresponding to Gly140,

Gly142 and Val 147 in the human Aurora A. However, the other hydrophobic pocket on the C- lobe is composed of Met113, IIe159 and Ala169 in TbAUK1, corresponding to Thr217, IIe263 and Ala273 in human Aurora A. Indeed, the change of Thr217 in human Aurora A to Met113 in

TbAUK1 is a possible source of selectivity of compounds between them, because the replacement of the hydrophilic Thr residue to a sterically more bulky and hydrophobic Met can cause electrostatic and steric differences in the pocket. We hypothesize that Met113 in TbAUK1 provides steric hindrance for aromatic groups and forces these groups to flip towards the N-lobe.

As shown in Figure 2-7A, in the human Aurora A structure (PDB ID: 2j50), the R group

67

(highlighted in the pink circled region in Figures 2-3A, 2-7A, and 2-7B) in the phosphate- binding region flips down towards to the C lobe (as seen in Figure 2-7A). In the TbAUK1 model (Figure 2-7B), Met113 causes steric hindrance to the R group and thus force R group to flip upwards to the N-lobe. For the compound with the most selectivity, NEU-327, the napthyl head group provides lipophilic interactions with the top of the binding pocket, where it also encounters Lys58 of TbAUK1 in a geometry that could allow a favorable π-cation interaction.41

In contrast, an adverse steric interaction of the naphthalene group of 8 is apparent upon docking to human Aurora A (Figure 2-3B), a result that is confirmed with a significantly reduced Glide docking score (-6.469) compared to 1 (-10.29). The selectivity rendered by this position is reminiscent of the difference at the Thr217 position in the human Aurora A, where Thr is replaced by Glu in the human Aurora B. This substitution results in compound selectivity between human Aurora A and B for the clinical candidate MLN805442. It is suspected that the planar urea moiety of NEU-174 is less able to accommodate the Thr-Met change in the protein, leading to the decrease in potency against T. brucei cultures, and a decrease in selectivity over

MOLT-4 cells. Docking experiments supported this. The headgroup of NEU-174

(selectivity=0.18) is rigidly held against Met113, which is likely to lead to lower activity against the parasite enzyme.

Another residue that may provide selectivity of compounds is the gatekeeper residue, corresponding to Met106 in TbAUK1 and Leu210 in human Aurora A. It has been reported that substitution/ mutation of the gatekeeper residue could alter selectivity in several kinases43.

Furthermore, the electrostatic potentials in the active sites of TbAUK1 and human Aurora A are quite different, as shown in Figure 2-8. The binding pocket of the human Aurora A is dominated by positive potential (observed in the structures with PDB ID: 2bmc, 2j4z, and 2j50),

68 while a neutral sub-pocket in the phosphate-binding region is clearly observed in TbAUK1. This may also explain in part why the analogs with more hydrophobic aromatic groups were more favored for binding TbAUK1 and thus give rise to compound selectivity between TbAUK1 and human Aurora A. Altogether, the structural and electrostatic difference in the binding pockets of the two proteins may provide important elements for compound selectivity design.

69

A B

C

Figure 2-7. Comparison of (A) the human Aurora A/danusertib complex (PDB ID: 2j50, danusertib colored in green). (B) The predicted conformation of danusertib (colored in light green)) in the human Aurora A in TbAUK1 model. Important ligand contact residues are illustrated as sticks in both views. The core substituents at R groups positioned at the phosphorylated pocket were highlighted with pink circles. (C) A superimposition of TbAUK1 vs. Human Aurora A. The important residues in the both proteins are shown in stick. Only the non-conserved residues in the two proteins are labeled (TbAUK1 ones were colored in magenta and human Aurora A ones were colored by element).

70

Table 2-4. The 3D structural alignment of the predicted important active site residues for the TbAUK1 model structure vs. Human Aurora A (PDB ID: 2j50). Each row represents a protein structure and each column represents a spatial position in the structural alignment. The boldface letters indicate the residues predicted to be functionally important using the top 8% of the POOL ranking. The aligned residues, which are different between TbAUK1 and human Aurora A are marked with an asterisk. 'y' indicates the important ligand contact residues reported in the literatures. 39

71

A

B

Figure 2-8. Comparison of the computed electrostatic potential of the binding pocket of (A) human Aurora A (PDB ID: 2j50) and (B) TbAUK1 model. (Negative potential is colored in red; positive potential is colored in blue)

72

2.3.7 Homology of TbAUKs and Comparison of TbAUK1, TbAUK2 and TbAUK3

To date, our collaborator (Prof. Larry Ruben, Southern Methodist University) has not successfully expressed catalytically active TbAUK1, which prevents our project from utilizing a biochemical kinase assay. However, we hypothesized that TbAUK2 and/or TbAUK 3 (which are easily expressed) could potentially be used as surrogates for TbAUK1. We therefore used molecular modeling techniques to compare the structures of TbAUK1, TbAUK2, and TbAUK3 to ascertain whether the 3D structures of these kinases would be sufficiently similar to expect their use as surrogates to be fruitful.

We carried out sequence alignment of the proteins using ClustalW2 to examine the primary sequence difference of the proteins. It shows the sequence identity of TbAUK1 and

TbAUK2 is 30%, whereas sequence identity of TbAUK1 and TbAUK3 is 31 %. We further built the homology models for TbAUK2 and TbAUK3. The best-ranked TbAUK2 model was built based on the template of human Aurora A complexed with AMP (PDB ID: 1MQ4).

Similarly, the best-ranked model of TbAUK3 was built based on human Aurora A as the template (PDB ID: 3fdn). TbAUK2 presents two extra helices in the activation loop, which is unique and quite different from the TbAUK1, TbAUK3 and human Aurora A as shown in

Figure 2-9.

We further studied the active site residues of TbAUK1, TbAUK2, TbAUK3 and human

Aurora using a structural alignment of active site residues predicted using the top 8% of POOL predicted residues. Table 2-5 depicts the structurally aligned, predicted active residues for the

TbAUK1, TbAUK2, and TbAUK3 models compared to human Aurora A. The predicted residues are shown in boldface. The residues in the TbAUK1, TbAUK2, TbAUK3 models that

73 are different from the corresponding aligned residues in human Aurora A are highlighted in purple. A comparison of TbAUK1, TbAUK2, TbAUK3 and human Aurora A (in Table 2-5), reveals 30 spatial positions out of a total of 62 aligned important positions, wherein the TbAUKs amino acids are different from the corresponding aligned residues in human Aurora A. These differences suggest that TbAUK2 and TbAUK3 are not comparable to TbAUK1, because the important ligand binding residues are different among these three proteins (TbAUK1, TbAUK2 and TbAUK3). The columns with these differences in the amino acids across the three proteins are highlighted in grey in Table 2-5.

The mutation/substitution of these important residues has been correlated with multiple drug resistance in human Aurora42 43b in experiments. Therefore, the difference in these critical residues may lead to significant difference in drug design. Mutation of the “gatekeeper” residue in Aurora has been commonly reported to occur in mammalian cells and lead to multiple drug resistance.43 It has been reported that the gatekeeper mutation of T351I in ABL-kinases confers resistance against small molecules by increasing or restoring the ABL-kinase activity, accompanied by aberrant transphosphorylation of endogenous BCR. 7 The gatekeeper residue is a Leu in human Aurora A (at position of 210), Leu in TbAUK2 and TbAUK3 (at positions 161 and 158 respectively), whereas Met (at position of 106) is present in the TbAUK1 as the gatekeeper residue.

In addition to the gatekeeper residue, multiple structurally aligned residues are different among TbAUK1, TbAUK2 and TbAUK3 (see Table 2-5 and Figure 2-10). For example,

Cys109 in TbAUK1 is structurally aligned to Ala161 in TbAUK2 and Cys164 in TbAUK3. In addition, another interesting position is Met113 in TbAUK1, corresponding to Thr in TbAUK3,

Ser in TbAUK2, and Thr217 in human Aurora A. It has been reported that the replacement of

74

Thr217 in human Aurora A to Glu in Aurora B rendered the selectivity for MLN 8054 in human

Aurora A over B42. The most unusual observation is that in the activation loop, the conserved

DFG motif in kinases is "DFS" in TbAUK3. The catalytic residues for human Aurora A have been reported in the Catalytic Site Atlas (CSA)44 to be residues Asp256, Lys258, Glu260, and

Asn261, which are conserved in TbAUK1. One out of the four catalytic resides, Glu260, in human Aurora A is replaced by Gly in TbAUK3.

In sum, the predicted important active site residues for the TbAUK1, TbAUK2 and

TbAUK3 have significant differences, and we concluded that TbAUK2 and TbAUK3 are not likely to be useful surrogates for TbAUK1.

75

Table 2-5. The structurally aligned, predicted functionally important residues in the models of TbAUK1, TbAUK2, TbAUK3, in human Aurora -A (PDB ID: 2bmc), and in mouse Aurora (PDB ID: 3d14). Red boxes indicate the catalytic residues in human Aurora A as reported in the Catalytic Site Atlas (CSA) (including residues in the 256-261 range). Part of the DFG loop is also highlighted in a red box (residues 274-276), together with their structurally aligned counterparts. The active site residues predicted by POOL (with INTREPID, THEMATICS and ConCavity information and 8% cutoff) are shown in boldface. 3D structural alignment was conducted using PDBefold. The purple highlighted regions show the non-conserved amino acids across the different proteins. The ligand contact residues in the crystal structures are labeled as 'y'.

76

DFG‐in Huaman Aurora A L139 G140 K141 G142 K143 F144 G145 V147 K162 V163 l164 Mouse Aurora A l152 G153 K154 G155 K156 F157 G158 V160 K175 V176 l177 TbAUK1 l35 G36 G37 G38 N39 Y40 G41 V43 K58 R59 l60 TbAUK2 l85 D86 E87 G88 R89 F90 G91 V93 K108 C109 I110 TbAUK3 l90 G91 E92 G93 S94 Y95 S96 V98 K113 E114 l115 Danusertib contact residues in TbAUK1y yyy yyy Reported ligand contact residues in human Aurora Ay y y y y y

Huaman Aurora A Q177 l178 E181 L194 R195 T204 V206 Y207 l208 I209 L210 Mouse Aurora A Q190 l191 E194 l207 R208 T217 V219 Y220 L221 I222 L223 TbAUK1 Q73 l74 E77 l90 R91 T100 I102 Y103 l104 I105 M106 TbAUK2 Q123 l124 E127 l142 R143 V152 I154 V155 l156 V157 l158 TbAUK3 Q128 l129 E132 v145 R146 R155 V157 V158 l159 V160 l161 Danusertib contact residues in TbAUK1 y y Reported ligand contact residues in human Aurora Ay y

Huaman Aurora A E211 A213 G216 T217 l222 Q223 l225 D229 Y236 R251 I253 Mouse Aurora A E224 A226 G229 T230 l235 Q236 l238 D242 Y249 R264 I266 TbAUK1 E107 C109 G112 M113 l118 N119 V121 A125 Y132 H147 l149 TbAUK2 E159 A161 G164 T165 l170 D171 Y173 P183 I190 D205 I207 TbAUK3 E162 C164 G167 S168 l173 Q174 T176 D182 Y189 G204 V206 Danusertib contact residues in TbAUK1 y y y y Reported ligand contact residues in human Aurora Ay y y y

Huaman Aurora A H254 D256 K258 E260 N261 l262 L263 k271 A273 D274 F275 Mouse Aurora A H267 D269 K271 E273 N274 l275 L276 k284 A286 D287 F288 TbAUK1 R151 D152 K154 E156 N157 I158 l159 K167 A169 D170 F171 TbAUK2 R209 D210 K212 D214 N215 I216 l217 l224 A226 D227 F228 TbAUK3 H207 D209 K211 G213 N214 I215 L216 R224 A226 D227 F228 Danusertib contact residues in TbAUK1 y y y y y Reported ligand contact residues in human Aurora A yyyyy

Huaman Aurora A G276 W277 S278 V279 H280 A281 S283 S284 D294 Y295 M305 Mouse Aurora A G289 W290 S291 V292 H293 A294 S296 C303 D307 Y308 M318 TbAUK1 G172 W173 S174 V175 H176 D177 S185 C186 E190 Y191 A201 TbAUK2 S229 W230 A231 V232 R233 V234 V247 C248 D252 Y253 G263 TbAUK3 G229 W230 S231 K232 G233 l234 V276 C277 D281 Y282 P292 Danusertib contact residues in TbAUK1 Reported ligand contact residues in human Aurora A

Huaman Aurora A H306 D311 l315 V317 l318 C319 Y320 Mouse Aurora A H319 D324 l328 V330 l331 C332 Y333 TbAUK1 Y202 D207 l211 I213 F214 C215 Y216 TbAUK2 C264 D269 V273 V275 V276 A277 Y278 TbAUK3 C293 D298 l302 A304 l305 l306 V307 Danusertib contact residues in TbAUK1 Reported ligand contact residues in human Aurora A

77

Extra helix on TbAUK2

Figure 2-9. (A) Superimposition of TbAUK1, TbAUK2 and human Aurora A reveals two extra helices on the TbAUK2 model on the activation loop. TbAUK1 is colored blue, TbAUK2 is colored green, the human Aurora A is colored grey. (B) Superimposition of TbAUK1, TbAUK3 and human Aurora A reveals two extra helices in the TbAUK2 model in the activation loop. TbAUK1 is colored blue, TbAUK3 is colored cyan and the human Aurora A is colored grey.

78

A

B

Figure 2-10. (A) Superimposition of TbAUK1, TbAUK2 and human Aurora A. The non- conserved residues around the ATP binding pocket are enlarged for TbAUK1, TbAUK2 and human Aurora (TbAUK1 colored blue, TbAUK2 colored green and human Aurora A colored grey); (B) superimposition of TbAUK1, TbAUK3 and human Aurora A. The non-conserved residues around the ATP binding pocket are enlarged for TbAUK1, TbAUK3 and human Aurora (TbAUK1 colored magenta, TbAUK3 colored cyan and human Aurora A colored grey)

79

2.4 Conclusions

In this project, we used molecular modeling techniques to facilitate the target repurposing of a human Aurora kinase inhibitor towards the parasite homolog TbAUK1. The homology model for TbAUK1 was established and validated by multiple bioinformatics tools and docking studies. The homology model of TbAUK1 was utilized in docking studies that helped prioritize the synthesis of target compounds and provided insights into possible explanations of inhibitor selectivity. Informed by homology modeling and docking, a series of analogs of danusertib were prepared to explore the scope of the chemotype. Many of the newly designed and synthesized compounds have potencies against trypanosome cells that are comparable to the cellular activity of danusertib6 against a variety of cancer cell lines, yet and has provided an up to 25-fold improvement in cellular selectivity for parasite cells over human cells. We further examined the active site residues of TbAUK1 and human Aurora A and some local structural differences may be used to help to improve further the compound selectivity design.

80

2.5 References 1. (a) Li, Z. Y.; Umeyama, T.; Wang, C. C., The Chromosomal Passenger Complex and a Mitotic Kinesin Interact with the Tousled-Like Kinase in Trypanosomes to Regulate Mitosis and Cytokinesis. PLoS One 2008, 3 (11); (b) Li, Z.; Umeyama, T.; Wang, C. C., The Aurora Kinase in Trypanosoma brucei Plays Distinctive Roles in Metaphase-Anaphase Transition and Cytokinetic Initiation. PLoS Pathog 2009, 5 (9), e1000575. 2. Bolanos-Garcia, V. M., Aurora kinases. Int. J. Biochem. Cell Biol. 2005, 37 (8), 1572- 1577. 3. (a) Eyers, P. A.; Churchill, M. E. A.; Maller, J. L., The Aurora A and Aurora B protein kinases - A single amino acid difference controls intrinsic activity and activation by TPX2. Cell Cycle 2005, 4 (6), 784-789; (b) Hans, F.; Skoufias, D. A.; Dimitrov, S.; Margolis, R. L., Molecular Distinctions between Aurora A and B: A Single Residue Change Transforms Aurora A into Correctly Localized and Functional Aurora B. Molecular Biology of the Cell 2009, 20 (15), 3491-3502. 4. Jetton, N.; Rothberg, K. G.; Hubbard, J. G.; Wise, J.; Li, Y.; Ball, H. L.; Ruben, L., The cell cycle as a therapeutic target against Trypanosoma brucei: Hesperadin inhibits Aurora kinase- 1 and blocks mitotic progression in bloodstream forms. Molecular Microbiology 2009, 72 (2), 442-458. 5. Sessa, F.; Mapelli, M.; Ciferri, C.; Tarricone, C.; Areces, L. B.; Schneider, T. R.; Stukenberg, R. T.; Musacchio, A., Mechanism of Aurora B activation by INCENP and inhibition by Hesperadin. Molecular Cell 2005, 18 (3), 379-391. 6. Fancelli, D.; Moll, J.; Varasi, M.; Bravo, R.; Artico, R.; Berta, D.; Bindi, S.; Cameron, A.; Candiani, I.; Cappella, P.; Carpinelli, P.; Croci, W.; Forte, B.; Giorgini, M. L.; Klapwijk, J.; Marsiglio, A.; Pesenti, E.; Rocchetti, M.; Roletto, F.; Severino, D.; Soncini, C.; Storici, P.; Tonani, R.; Zugnoni, P.; Vianello, P., 1,4,5,6-tetrahydropyrrolo 3,4-c pyrazoles: Identification of a potent aurora kinase inhibitor with a favorable antitumor kinase inhibition profile. J. Med. Chem. 2006, 49 (24), 7247-7251. 7. Modugno, M.; Casale, E.; Soncini, C.; Rosettani, P.; Colombo, R.; Lupi, R.; Rusconi, L.; Fancelli, D.; Carpinelli, P.; Cameron, A. D.; Isacchi, A.; Moll, J., Crystal structure of the T315l Abl mutant in complex with the aurora kinases inhibitor PHA-739358. Cancer Research 2007, 67 (17), 7987-7990.

81

8. Guimaraes, C. R. W.; Rai, B. K.; Munchhof, M. J.; Liu, S.; Wang, J.; Bhattacharya, S. K.; Buckbinder, L., Understanding the Impact of the P-loop Conformation on Kinase Selectivity. J. Chem Inf. Model. 2011, 51 (6), 1199-1204. 9. Caffrey, D. R.; Lunney, E. A.; Moshinsky, D. J., Prediction of specificity-determining residues for small-molecule kinase inhibitors. Bmc Bioinformatics 2008, 9. 10. Perola, E.; Walters, W. P.; Charifson, P. S., A detailed comparison of current docking and scoring methods on systems of pharmaceutical relevance. Proteins 2004, 56 (2), 235-249. 11. Verdonk, M. L.; Mortenson, P. N.; Hall, R. J.; Hartshorn, M. J.; Murray, C. W., Protein- Ligand Docking against Non-Native Protein Conformers. J. Chem Inf. Model. 2008, 48 (11), 2214-2225. 12. Tuccinardi, T.; Botta, M.; Giordano, A.; Martinelli, A., Protein Kinases: Docking and Homology Modeling Reliability. J. Chem Inf. Model. 2010, 50 (8), 1432-1441. 13. Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J. H.; Zhang, Z.; Miller, W.; Lipman, D. J., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 1997, 25 (17), 3389-3402. 14. Krieger, E.; Joo, K.; Lee, J.; Raman, S.; Thompson, J.; Tyka, M.; Baker, D.; Karplus, K., Improving physical realism, stereochemistry, and side-chain accuracy in homology modeling: Four approaches that performed well in CASP8. Proteins 2009, 77, 114-122. 15. Kolodny, R.; Petrey, D.; Honig, B., Protein structure comparison: implications for the nature of 'fold space', and structure and function prediction. Curr. Opin. Struct. Biol. 2006, 16 (3), 393-398. 16. Fancelli, D.; Moll, J.; Varasi, M.; Bravo, R.; Artico, R.; Berta, D.; Bindi, S.; Cameron, A.; Candiani, I.; Cappella, P.; Carpinelli, P.; Croci, W.; Forte, B.; Giorgini, M. L.; Klapwijk, J.; Marsiglio, A.; Pesenti, E.; Rocchetti, M.; Roletto, F.; Severino, D.; Soncini, C.; Storici, P.; Tonani, R.; Zugnoni, P.; Vianello, P., 1,4,5,6-tetrahydropyrrolo[3,4-c]pyrazoles: identification of a potent Aurora kinase inhibitor with a favorable antitumor kinase inhibition profile. J Med Chem 2006, 49 (24), 7247-51. 17. Oslob, J. D.; Romanowski, M. J.; Allen, D. A.; Baskaran, S.; Bui, M.; Elling, R. A.; Flanagan, W. M.; Fung, A. D.; Hanan, E. J.; Harris, S.; Heumann, S. A.; Hoch, U.; Jacobs, J. W.; Lam, J.; Lawrence, C. E.; McDowell, R. S.; Nannini, M. A.; Shen, W.; Silverman, J. A.; Sopko, M. M.; Tangonan, B. T.; Teague, J.; Yoburn, J. C.; Yu, C. H.; Zhong, M.; Zimmerman, K. M.;

82

O’Brien, T.; Lew, W., Discovery of a potent and selective Aurora kinase inhibitor. Bioorganic & Medicinal Chemistry Letters 2008, 18 (17), 4880-4884. 18. Rose, P. W.; Beran, B.; Bi, C.; Bluhm, W. F.; Dimitropoulos, D.; Goodsell, D. S.; Prlic, A.; Quesada, M.; Quinn, G. B.; Westbrook, J. D.; Young, J.; Yukich, B.; Zardecki, C.; Berman, H. M.; Bourne, P. E., The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids Research 2011, 39, D392-D401. 19. Krieger, E.; Koraimann, G.; Vriend, G., Increasing the precision of comparative models with YASARA NOVA - a self-parameterizing force field. Proteins-Structure Function and Genetics 2002, 47 (3), 393-402. 20. (a) Ko, J.; Murga, L. F.; Wei, Y.; Ondrechen, M. J., Prediction of active sites for protein structures from computed chemical properties. Bioinformatics 21 (suppl 1), i258-i265; (b) Ko, J. J.; Murga, L. F.; Andre, P.; Yang, H. Y.; Ondrechen, M. J.; Williams, R. J.; Agunwamba, A.; Budil, D. E., Statistical criteria for the identification of protein active sites using theoretical microscopic titration curves. Proteins 2005, 59 (2), 183-195. 21. Wei, Y.; Ko, J.; Murga, L. F.; Ondrechen, M. J., Selective prediction of interaction sites in protein structures with THEMATICS. Bmc Bioinformatics 2007, 8. 22. (a) Tong, W. X.; Wei, Y.; Murga, L. F.; Ondrechen, M. J.; Williams, R. J., Partial Order Optimum Likelihood (POOL): Maximum Likelihood Prediction of Protein Active Site Residues Using 3D Structure and Sequence Properties. Plos Computational Biology 2009, 5 (1); (b) Somarowthu, S.; Yang, H.; Hildebrand, D. G. C.; Ondrechen, M. J., High-performance prediction of functional residues in proteins with machine learning and computed input features. Biopolymers 2011, n/a-n/a. 23. Somarowthu, S.; Yang, H. Y.; Hidebrand, D. G. C.; Ondrechen, M. J., High-Performance Prediction of Functional Residues in Proteins with Machine Learning and Computed Input Features. Biopolymers 2011, 95 (6), 390-400. 24. Sankararaman, S.; Sjoelander, K., INTREPID-INformation-theoretic TREe traversal for Protein functional site IDentification. Bioinformatics 2008, 24 (21), 2445-2452. 25. Capra, J. A.; Laskowski, R. A.; Thornton, J. M.; Singh, M.; Funkhouser, T. A., Predicting Protein Ligand Binding Sites by Combining Evolutionary Sequence Conservation and 3D Structure. Plos Computational Biology 2009, 5 (12).

83

26. Krissinel, E.; Henrick, K., Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallographica Section D-Biological Crystallography 2004, 60, 2256-2268. 27. Quinonero, D.; Tomas, S.; Frontera, A.; Garau, C.; Ballester, P.; Costa, A.; Deya, P. M., OPLS all-atom force field for squaramides and squaric acid. Chemical Physics Letters 2001, 350 (3-4), 331-338. 28. Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J., Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 1997, 23 (1-3), 3-25. 29. Yu, Z. Y.; Jacobson, M. P.; Friesner, R. A., What role do surfaces play in GB models? A new-generation of surface-generalized Born model based on a novel Gaussian surface for biomolecules. Journal of Computational Chemistry 2006, 27 (1), 72-89. 30. Rush, T. S.; Grant, J. A.; Mosyak, L.; Nicholls, A., A Shape-Based 3-D Scaffold Hopping Method and Its Application to a Bacterial Protein−Protein Interaction. J. Med. Chem. 2005, 48 (5), 1489-1495. 31. (a) OMEGA OpenEye Science Software: Santa Fe, USA., 2001; (b) Bostrom, J.; Greenwood, J. R.; Gottfries, J., Assessing the performance of OMEGA with respect to retrieving bioactive conformations. Journal of Molecular Graphics & Modelling 2003, 21 (5), 449-462. 32. Tawa, G. J.; Baber, J. C.; Humblet, C., Computation of 3D queries for ROCS based virtual screens. J. Comput.-Aided Mol. Des. 2009, 23 (12), 853-868. 33. Fancelli, D.; Berta, D.; Bindi, S.; Cameron, A.; Cappella, P.; Carpinelli, P.; Catana, C.; Forte, B.; Giordano, P.; Giorgini, M. L.; Mantegani, S.; Marsiglio, A.; Meroni, M.; Moll, J.; Pittala, V.; Roletto, F.; Severino, D.; Soncini, C.; Storici, P.; Tonani, R.; Varasi, M.; Vulpetti, A.; Vianello, P., Potent and selective aurora inhibitors identified by the expansion of a novel scaffold for protein kinase inhibition. J. Med. Chem. 2005, 48 (8), 3080-3084. 34. Laskowski, R. A.; Macarthur, M. W.; Moss, D. S.; Thornton, J. M., PROCHECK - a program to check the stereochemical quality of protein structures. Journal of Applied Crystallography 1993, 26, 283-291. 35. Chen, V. B.; Arendall, W. B.; Headd, J. J.; Keedy, D. A.; Immormino, R. M.; Kapral, G. J.; Murray, L. W.; Richardson, J. S.; Richardson, D. C., MolProbity: all-atom structure validation

84 for macromolecular crystallography. Acta Crystallographica Section D-Biological Crystallography 2010, 66, 12-21. 36. Liu, Y.; Gray, N. S., Rational design of inhibitors that bind to inactive kinase conformations. Nature Chemical Biology 2006, 2 (7), 358-364. 37. Wang, W.; Kollman, P. A., Free energy calculations on dimer stability of the HIV protease using molecular dynamics and a continuum solvent model. Journal of Molecular Biology 2000, 303 (4), 567-582. 38. Hou, T.; Wang, J.; Li, Y.; Wang, W., Assessing the Performance of the MM/PBSA and MM/GBSA Methods. 1. The Accuracy of Binding Free Energy Calculations Based on Molecular Dynamics Simulations. J. Chem Inf. Model. 2010, 51 (1), 69-82. 39. Fancelli, D.; Moll, J.; Varasi, M.; Bravo, R.; Artico, R.; Berta, D.; Bindi, S.; Cameron, A.; Candiani, I.; Cappella, P.; Carpinelli, P.; Croci, W.; Forte, B.; Giorgini, M. L.; Klapwijk, J.; Marsiglio, A.; Pesenti, E.; Rocchetti, M.; Roletto, F.; Severino, D.; Soncini, C.; Storici, P.; Tonani, R.; Zugnoni, P.; Vianello, P., 1,4,5,6-Tetrahydropyrrolo[3,4-c]pyrazoles: Identification of a Potent Aurora Kinase Inhibitor with a Favorable Antitumor Kinase Inhibition Profile. J. Med. Chem. 2006, 49 (24), 7247-7251. 40. Minowada, J.; Ohnuma, T.; Moore, G. E., Rosette forming human lymphoid cell lines part 1 establishment and evidence for origin of thymus derived lymphocytes. Journal of the National Cancer Institute 1972, 49 (3), 891-895. 41. Ma, J. C.; Dougherty, D. A., The cation-pi interaction. Chemical Reviews 1997, 97 (5), 1303-1324. 42. Dodson, C. A.; Kosmopoulou, M.; Richards, M. W.; Atrash, B.; Bavetsias, V.; Blagg, J.; Bayliss, R., Crystal structure of an Aurora-A mutant that mimics Aurora-B bound to MLN8054: insights into selectivity and drug design (vol 427, pg 19, 2009). Biochemical Journal 2010, 427, 551-551. 43. (a) Yun, C. H.; Mengwasser, K. E.; Toms, A. V.; Woo, M. S.; Greulich, H.; Wong, K. K.; Meyerson, M.; Eck, M. J., The T790M mutation in EGFR kinase causes drug resistance by increasing the affinity for ATP. Proceedings of the National Academy of Sciences of the United States of America 2008, 105 (6), 2070-2075; (b) Kluter, S.; Simard, J. R.; Rode, H. B.; Grutter, C.; Pawar, V.; Raaijmakers, H. C. A.; Barf, T. A.; Rabiller, M.; van Otterlo, W. A. L.; Rauh, D., Characterization of Irreversible Kinase Inhibitors by Directly Detecting Covalent Bond

85

Formation: A Tool for Dissecting Kinase Drug Resistance. Chembiochem 2010, 11 (18), 2557- 2566. 44. Porter, C. T.; Bartlett, G. J.; Thornton, J. M., The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Research 2004, 32, D129-D133. 45. Friesner, R. A.; Murphy, R. B.; Repasky, M. P.; Frye, L. L.; Greenwood, J. R.; Halgren, T. A.;

Sanschagrin, P. C.; Mainz, D. T., Extra Precision Glide: Docking and Scoring Incorporating a Model of

Hydrophobic Enclosure for Protein−Ligand Complexes. J. Med. Chem. 2006, 49 (21), 6177-6196.

46. Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D. T.; Repasky,

M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.; Francis, P.; Shenkin, P. S., Glide: A New

Approach for Rapid, Accurate Docking and Scoring. 1. Method and Assessment of Docking Accuracy. J.

Med. Chem. 2004, 47 (7), 1739-1749.

86

CHAPTER 3 STRUCTURE-BASED T. brucei DRUG DESIGN USING COMPARATIVE MODELING OF THE PHOSPHODIESTERASE TARGETS TbrPDEB1 AND TbrPDEB2

87

3.1 Introduction In mammals, eleven PDE families with 60 different splice variants have been identified based on DNA sequence analysis and on their biochemical and pharmacological characteristics.1

These PDE proteins share >25% sequence identity in the catalytic domain, which is roughly 250 amino acids long. These PDE proteins are key regulators, hydrolyzing the secondary messengers cyclic adenosine monophosphate (cAMP) and cyclic guanosine monophosphate (cGMP) into

AMP and GMP.1 They control the spatial and temporal shapes of cyclic nucleotide signals as well as the steady-state levels of intracellular cAMP.2 Over the last two decades, PDEs in humans have become a highly attractive target for drug development, since they regulate cellular levels of cAMP and cGMP and thus affect a large variety of physiological processes, including immune response, visual response, glycogenolysis, apoptosis and growth control.3 Some highly selective inhibitors have been developed, leading to clinical candidates and drugs for chronic obstructive pulmonary disorder (PDE4)4, erectile dysfunction (PDE5)5 and schizophrenia

(PDE10)6.

In T. brucei, there are five PDEs that have been identified, namely TbrPDEA, TbrPDEB1,

TbrPDEB2, TbrPDEC, and TbrPDED.7 The TbrPDEB family is encoded by two tandemly arranged genes and these two PDEBs have distinct sub-cellular localizations.8 TbrPDEB1 and

TbrPDEB2 are not individually essential for parasite survival but when knocked down by RNAi simultaneously, the parasites are incapable of proper cell division and ultimately die.8

Design of inhibitors with potency and selectivity for different TbrPDEB subtypes is challenging due to the lack of reported structures for these proteins. Understanding of the structures of these TbrPDEB subtypes at the atomic level would greatly facilitate the design of subtype-selective inhibitors with reduced side effects and improved pharmacological profiles.

As there are no available 3D structures for these targets, comparative models of TbrPDEB1 and

88

TbrPDEB2 have been constructed and validated to provide structural information to guide T. brucei drug design against these targets.

89

3.2 Method

3.2.1 Comparative modeling

The protein sequences of TbrPDEB1 and TbrPDEB2 (see Supporting Information) were obtained from TriTrypDB (http://tritrypdb.org/) using the gene identification numbers for

TbrPDEB1 (Tb09.160.3590) and TbrPDEB2 (Tb09.160.3630). The protein sequences of both proteins were searched against the PDB database (http://www.pdb.org/) using PSI-BLAST.9

Only the catalytic domains of these proteins were used for comparative modeling using the homology feature in the YASARA suite of programs.10 Multiple TbrPDEB1 and TbrPDEB2 models were built using human PDE4 and/or Leishmania major PDB1 (LmjPDEB1)11 as templates. The best-ranked models for both TbrPDEB1 and TbrPDB2 were based on

LmjPDEB1 as the template. Multiple modeling evaluation tools were used to confirm the quality of the model structures, including PROCHECK,12 MolProbity,13 and Verify3D.14 .

3.2.2 The active site prediction and alignment

The active sites for the comparative models were predicted computationally using

THEMATICS15 and POOL16. The POOL calculations were conducted using the input features from INTREPID17 (http://phylogenomics.berkeley.edu/INTREPID/), Concavity, and

THEMATICS, unless otherwise mentioned. Three-dimensional structural alignments are performed using T-Coffee Expresso (3D Coffee)18, and PDBefold (http://www.ebi.ac.uk/msd- srv/ssm/ssmstart.html)19 in combination with a manual alignment.

3.2.3 Docking of models

The model TbPDEB1 structures were further processed using the Maestro 9.1 protein preparation wizard (Schrodinger, LLC, 2010, New York, NY). A restrained minimization of the

90 protein structure was performed using the default constraint of 0.3 Å RMSD and the OPLS 2001 force field. The 3D coordinates of piclamilast and its analogs were then generated using the ligprep utility in Maestro 9.0. The docking parameters were first examined by replication of the crystal structures of the PDE4D/roflumilast complex (PDB ID 1XOQ) and the PDE4B/rolipram complex (PDB ID 1XMY).20 Docking was performed with Glide version 3.5 in standard precision (SP) mode. The docking experiments were conducted with the constraint that at least one H-bond must be formed between the ligand and conserved Gln in the P clamp. The

Mulliken charge was used to estimate the partial charges of compounds and was assigned using semi-empirical AM1 in YASARA software.

91

3.3 Results and discussion

3.3.1 Comparative modeling of TbrPDEB1 and TbrPDEB2

To compare the similarity and difference among parasite TbrPDEB1, TbrPDEB2 and to facilitate compound binding hypotheses to drive medicinal chemistry optimization, homology models of TbrPDEB1 and B2 were constructed.

The TbrPDEB1 and TbrPDEB2 proteins contain multiple domains, including the GAF-A,

GAF-B and catalytic domain21. After the TbrPDEB1 and TbrPDEB2 sequences were searched using PSI-BLAST against the PDB database, the closest structural template retrieved for both

TbrPDEB1 and TbrPDEB2 was a PDE of the related protozoan parasite Leishmania major

LmjPDB1 (PDB ID: 2r8q). Since for the present purpose we are only interested in the catalytic domains of the proteins, the sequences of the two proteins were truncated and only the sequences of the catalytic domains were kept. The sequence identities of the catalytic domains between

LmjPDEB1 (sequence 597 to 930) and TbrPDEB1 (sequence 575 to 918) and TbrPDEB2

(sequence 585 to 918) are 66% and 65%, respectively. The high sequence identity between the target proteins and the template structure suggests the feasibility of the comparative modeling.

Two template structures were tried (Human PDE4 and LmjPDEB1), but the LmjPDEB1 crystal structure11 yielded the better-scoring model structures and has the higher sequence similarity to the targets (~65%). The model structures for the catalytic domain of TbrPDEB1 and TbrPDEB2 are comprised of 16 α helices and β strands. Roughly, these 16 α helices may be further divided into three subdomains consisting of helices 1-7, 8-11, and 12-16 respectively.

The active sites are located at the interface of the three subdomains. The comparative models generated for both TbrPDEB1 and TbrPDEB2 are dimers with a twofold symmetry axis similar

92 to that of the crystallized LmjPDEB1. For simplicity, only a single chain has been used for analysis in the chapter, unless otherwise mentioned.

3.3.2 Validation of the structures of models

A Ramachandran plot, generated using PROCHECK12, gives information on the stereo chemical quality of a protein structure. Based on an analysis of 118 x-ray crystal structures at

2.0 Å resolution, it has been reported that a good quality model would be expected to have at least 90% of the residues in the favored regions.12 As shown in the Ramachandran plot for the

TbrPDEB1 model (Figure 3-1A), 91.7% of the residues are in favored regions; 99.7% of the residues are in allowed regions; and no residues are located in the disallowed regions. These data indicate that the backbone phi and psi angles of the TbrPDEB1 model structure are within acceptable ranges. Similarity, in the Ramachandran plot for the TbrPDEB2 model (Figure 3-

1B), 93.8% of the residues are in favored regions. Only one residue, Ser802, is located in the disallowed region. However, this outlier residue is not located in the binding pocket of

TbrPDEB1 and the POOL calculation indicates that this residue is not a functionally important residue.

The models were further examined using Verify3D, which analyzes the 3D model structure against the 1D sequence.14 Each residue in the sequence is assigned a structural class based on its location and environment, e.g., alpha-helix or beta-sheet, polar or non-polar. In this method, residues are scored according to the compatibility of the sequence with its environment, using a set of good quality structures as the reference set. Residues are scored on a scale of -10 to +10, using a sliding 21 amino acid window. All residues in the model structures of TbrPDEB1 and TbrPDEB2 were found to have positive scores, indicating the structures contain no regions of low quality.

93

MolProbity analyzes all atom contacts and other geometric properties to provide overall statistical evaluations of model structure quality and flags the local problem areas. Through the

MolProbity evaluation, only one residue, Pro884, in TbrPDEB1 is found to have a "bad angle".

This residue is not annotated as an important active site residue using either an 8% or a 10%

POOL cutoff. Similarly, in TbrPDEB2, one outlier residue, Pro861, with a "bad angle" was observed. However, one bad-scoring residue is tolerable, particularly since this residue is not annotated as an important active site residue nor is it a contact residue to the compounds of interest. In summary, these data indicate that the stereochemical properties of the model structures of TbrPDEB1 and TbrPDEB2 are of good quality.

94

A

B

Figure 3-1. The Ramachandran Plots of the comparative models (A) TbrPDEB1 and (B)

TbrPDEB2

95

3.3.3 Comparison of TbrPDEB1 andTbrPDEB2

It is necessary to identify compounds capable of inhibiting both TbrPDEB1 and

TbrPDEB2, since RNAi only kills trypanosomes when both enzymes are disrupted. The feasibility of achieving dual inhibition is supported by analysis of the protein sequences. Overall protein sequence identity between TbrPDEB1 and TbrPDEB2 is 75% using a ClustalW2 alignment22; the sequence identity within the catalytic domains of these two proteins is 88%, indicating the high similarity in the primary sequence.

96

Table 3-1. Structural alignment of the predicted functionally important active site residues of TbrPDEB1, TbrPDEB2 and Lmj PDEB1 obtained with (A) 8% POOL cutoff, and (B) with 10% POOL cutoff. The residues annotated as important residues using POOL calculations are shown in boldface. The differences in residue types between TbrPDEB1 and TbrPDEB2 are highlighted in yellow. A TbrPDEB1 Y668 H669 N670 H673 V674 H709 D710 L711 TbrPDEB2 Y668 H669 N670 H673 V674 H709 D710 L711 LmjPDEB Y680 H681 N682 H685 V686 H721 D722 L723

TbrPDEB1 H713 M714 N717 N718 S719 L741 E742 V743 TbrPDEB2 H713 M714 N717 N718 S719 L741 E742 V743 LmjPDEB H725 M726 N729 N730 S731 L753 E754 V755

TbrPDEB1 H745 C746 T783 D784 M785 A786 K787 D822 TbrPDEB2 H745 C746 T783 D784 M785 A786 R787 D823 LmjPDEB H757 C758 T795 D796 M797 A798 K799 D835

TbrPDEB1 I823 S824 V840 T841 E842 E843 F844 Y845 TbrPDEB2 I824 S825 V841 T842 E843 E844 F845 Y846 LmjPDEB V836 S837 V853 T854 E855 E856 F857 Y858

TbrPDEB1 Q847 M861 F862 D863 G873 Q874 I875 F877 TbrPDEB2 Q848 M862 F863 D864 G874 Q875 I876 F878 LmjPDEB Q860 M874 F875 D876 G886 Q887 I888 F890

TbrPDEB1 I878 V881 A882 TbrPDEB2 I879 V882 A883 LmjPDEB I891 V894 A895

97

B

TbrPDEB1 C600 P603 D613 V627 Y668 H669 N670 H673 V674 TbrPDEB2 I600 F602 D613 V627 Y668 H669 N670 H673 V674 LmjPDEB I611 F613 N625 A639 Y680 H681 N682 H685 V686

TbrPDEB1 V677 H709 D710 L711 H713 M714 L716 N717 N718 TbrPDEB2 V677 H709 D710 L711 H713 M714 L716 N717 N718 LmjPDEB V689 H721 D722 L723 H725 M726 V728 N729 N730

TbrPDEB1 S719 F720 L741 E742 V743 H745 C746 T783 D784 TbrPDEB2 S719 F720 L741 E742 V743 H745 C746 T783 D784 LmjPDEB S731 F732 L753 E754 V755 H757 C758 T795 D796

TbrPDEB1 M785 A786 K787 D822 I823 S824 N825 V826 W836 TbrPDEB2 M785 A786 R787 D823 I824 S825 N826 V827 W837 LmjPDEB M797 A798 K799 D835 V836 S837 N838 V839 W849

TbrPDEB1 A837 M838 V840 T841 E842 E843 F844 Y845 Q847 TbrPDEB2 A838 M839 V841 T842 E843 E844 F845 Y846 Q848 LmjPDEB A850 M851 V853 T854 E855 E856 F857 Y858 Q860

TbrPDEB1 G848 M861 F862 D863 G873 Q874 I875 F877 I878 TbrPDEB2 G849 M862 F863 D864 G874 Q875 I876 F878 I879 LmjPDEB G861 M874 F875 D876 G886 Q887 I888 F890 I891

TbrPDEB1 V881 A882 TbrPDEB2 V882 A883 LmjPDEB V894 A895 .

The important active sites residues for the two proteins were studied further using local active site alignment. A 3D structural local active site alignment table has been generated based

98 on active sites predicted from the top 8% of the POOL rankings (Table 3-1A), and from the top

10% of the POOL rankings (Table 3-1B). The tables are constructed in such a way that the vertical columns correspond to protein structures and the rows correspond to spatial positions occupied by important active site residues predicted by POOL. Structurally aligned positions are excluded from the table if they do not include any residues predicted to be functionally important.

The active site residues predicted by the POOL calculation are shown in boldface. In the alignment of the top 8% of the POOL ranked residues shown in Table 3-1A, the important residues of TbrPDEB1 and TbrPDEB2 are almost identical, except that K787 in TbrPDEB1 is replaced by R787 in TbrPDEB2. Based on the alignment of the top 10% of the POOL ranked residues shown in Table 3-1B, three residues are different between TbrPDEB1 and TbrPDEB2, namely K787, C600 and P603 in TbrPDEB1 are replaced by R787, I600 and F602in TbrPDEB2, respectively. It indicates the high similarity in the active sites (binding pockets) of these two proteins. Two non-conserved residues, C600 in PDEB1 I600 in PDEB2; and P603 in PDEB1 that corresponds to F602 in PDEB2 (colored in yellow in Table 3-1B) are far away from the binding pocket, indicting they are likely to be false positives within the top 10% of the ranked residues. It has been reported that POOL has been successful in placing the most functionally important residues at the top of the rank order. The top 5% of the POOL ranked residues have been shown to predict 86.7% of the active site residues with a 5% false positive rate in the benchmark set of 170 enzymes, the Catalytic Site Altas-100 (CSA-100)23. In our experiments, the top 8% of the POOL ranked residues predict all of the known catalytically important residues and the majority of the ligand binding residues reported for the LmjPDEB1 protein24. Therefore, it is reasonable to assume that the top 8% of the POOL ranked residues provide enough of the

99 functionally important active site residues with a low false positive rate for the TbrPDEs proteins.

Thus the 8% cutoff was used for active site residue prediction for present purposes.

To compare the similarity between TbrPDEB1 and TbrPDEB2, the two comparative models were superimposed as shown in Figure 3-2. The RMSD of these two models is 0.59Å, suggesting high similarity of the two model structures. Inspection of the two model structures reveals that almost all of the active site residues can be overlapped (see Figure 3-2 and Table 3-

1A). These structurally identical active sites in the homology model structures of the two

TbrPDEBs confirmed their pairwise conservation is high and that they possess regions previously observed in other PDEs, including a metal binding pocket (M pocket); a solvent-filled side pocket (S pocket); and a pocket containing the conserved purine-binding glutamine residue and hydrophobic clamp (Q pocket), consisting of two hydrophobic pockets (Q1 and Q2) on either side20. Besides confirming the high similarity in the active sites between TbrPDEB1 and

B2, homology modeling also indicates that both are likely to have an extended binding site cleft, termed the “parasite-“ or “P-pocket,” that was first observed in LmjPDEB1.25 This extended pocket provides a tunnel from the binding site leading to the exterior of the protein, consistent with the P-pocket observed in LmjPDEB1. This feature is absent from every human PDE, and may therefore be exploitable for development of selective inhibitors of parasite PDEs.

The comparison of TbrPDEB1 and B2 further supports the hypothesis that TbrPDEB1 could be used as a surrogate for TbrPDEB2 in the initial stage of biological screening to limit the cost and improve the efficiency. Based on the high similarity between TbrPDEB1 and

TbrPDEB2 in their global structures and the active sites, it is feasible to develop ligands to inhibit the two proteins simultaneously. This hypothesis has been further confirmed by

100 experiments in which several benchmark compounds of human PDE4 have proven to inhibit both TbrPDEB1 and TbrPDEB226.

101

Figure 3-2. The superimposition of TbrPDEB1 (cyan) and TbrPDEB2 (magenta). The predicted active sites of TbrPDEB1 and TbrPDEB2 (with 8% POOL cutoff) are shown as sticks and are nearly identical. K787 in TbrPDEB1 and R787 in TbrPDEB2 (yellow shaded residues in Table 3-1 are shown in ball & stick. ) A B

Figure 3-3. (A) Piclamilast complexed with human PDE4 in the crystal structure (PDB ID:1xm4) and (B) predicted pose for piclamilast interacting with the TbrPDEB1 model.

102

3.3.4 Docking study of TbrPDEB1

To help elucidate binding site features that could drive potency difference between the parasite and human enzymes and to gain a better understanding of compound binding to guide medicinal chemistry optimization, a series of compounds of interest were docked into the

TbrPDEB1 model. These included 11 analogs of the human PDE4 inhibitor piclamilast4 shown in Table 3-2, together with three benchmark compounds for human PDE4 illustrated in Figure

3-4. The predicted binding poses for these compounds were examined and the binding features included: 1) the pyrrolidinone and dichloropyridyl head groups of the compounds (such as those shown in Figure 3-4), is positioned towards the M-pocket of TbrPDEB1; 2) A pivotal bidentate hydrogen bonding interaction forms between the invariant binding site glutamine (Q874 in

TbrPDEB1) and the catechol ether moiety of the compound; 3) the dialkoxyphenyl scaffold is sandwiched by the conserved hydrophobic clamp (V840 and F877 in TbrPDEB1) with alkoxyl groups pointing to the Q1 and Q2 pockets, respectively. The binding features for the piclamilast analogs in the TbrPDEB1 model were qualitatively similar to the binding poses observed for catechol ether-containing PDE inhibitors20 of human PDE4 (see Figure 3-3).

It has been reported that the docking score can either be set as a cutoff line for classifying active versus inactive compounds or used for establishment of a QSAR model, as a lower (more negative) docking score normally corresponds to a stronger binding affinity between protein and ligand.27 An investigation of the Glide XP docking results reveals that the active compounds have a docking score less than -11.5, while most of the inactive compounds have a docking score higher than -11.5, as shown in Table 3-2. The exceptions were entries # 5 and #8. In entry #8 the compound has an ethoxy substituent at R3 position. It had a docking score of -12.6 corresponding to an active compound, but was inactive in the experiments. Similarly, despite

103 this false positive data point for compound #8, the docking study does provide a general ability to distinguish between active compounds and inactive compounds via the docking score. It is expected that a high-throughput screening using docking experiments could significantly enrich the set of likely active compounds.

3.3.5 The potency of the compounds Analysis of the structures of the ligand-protein complexes has revealed major contributors that likely affect the binding affinity of these compounds. For example, picalmilast is an active compound with a inhibitory potency (IC50 ) of 4.7 μM to TbrPDEB1, while rolipram is an inactive compound with a IC50 more than 100 μM to TbrPDEB1. The dichloropyridyl substituent of piclamilast could exert maximum interactions, as it can form hydrophobic interactions with M785 and L823 of TbrPDEB1; and the heterocyclic nitrogen atom on the pyridine ring can form an H-bond to a water molecule coordinated to a metal ion. Additionally, the nitrogen and oxygen atoms in the amide linkage between the dichloropyridyl group to the dialkoxyphenyl scaffold also form extra H-bonds with water molecules. The smaller pyrrolidinone substituent in rolipram has less hydrophobic interactions to the binding pocket of the TbrPDEB1 and forms a relatively lower binding affinity compared with the piclamilast analogs. Most of the compounds in Table 3-2 (entries #1-7 and 10-12) have an identical methoxy substituent at the R3 position, thus the substituent at the R2 determines the potency of compound #1-7. It can be seen that hydrophobic interaction and thus the potency of the compounds increases (see Table 3-2), as the alkyl substituent grows. Obviously, the hydrophobic interaction and H-bonds would have an effect on the compounds' potency.

104

Table 3-2. Summary of the docking and SAR results for a series of piclamilast analogues.

Entry R R2 R3 R4 R5 TbrPDEB1 TbrPDEB2 T. brucei Glide MM- Predicted a a 1 IC50 (μM) IC50 (μM) EC50 XP GBSA (μM) docking (Kcal/m Active score ol)

1 H OMe H H yes (piclamilast ) 9.6 ± 4.7 ± 1.0 11.4± 1.1 0.9 -11.80 -33.29

2 H OMe OMe H H >100 >100 -10.97 -23.47 no 3 H OEt OMe H H 16.5±3.4 34.0±0.9 -12.54 -31.29 yes 4 H OPr OMe H H 13.0±2. -12.60 yes 13.6± 4.4 9.4±2.4 4 -29.21 5 H OiPr OMe H H 27.6±7. -12.44 yes >30b >30b 1 -26.89 6 H OBu OMe H H 17.7±3. -13.18 yes 7.7 ± 3.9 14.3±3.7 1 -30.01 7 H OBn OMe H H 10.1±1. -12.17 yes 12.5±5.3 11.2±1.1 2 -33.89 8 H OEt OEt H H >100 -12.63 -33.57 yes 9 H OMe OAc H H >100 -9.65 -25.66 no 10 H OMe OMe OMe H >>100 >>100 -9.94 -19.07 no 11 Cl OMe OMe H H >>100 -10.90 -24.69 no 12 H OMe OMe H Cl >>100 -6.41 -18.03 no

105

Rolipram

Piclamilast Roflumilast

Figure 3-4. Three benchmark PDE compound structures of rolipram, roflumilast and piclamilast

106

It is noted that electrostatic interaction may also affect the binding affinity between the compounds and TbrPDEB1. For instance, piclamilast is an active compound, while roflumilast is an inactive compound. The structural differences between piclamilast and roflumilast are located at the R2 and R3 substituents. The R2 substituent is a cyclopentyl group for piclamilast and a cyclopropyl group for roflumilast. The compounds entries # 1 and 3-7 in Table 3-2, which have different alkyl substituents at R2, all showed efficacy for TbrPDB1 with IC50 values lower than 20 μM. Thus, the different substituent at the R3 position is the leading factor to determine the potency and efficacy of picalmilast and roflumilast. The R3 substituent for piclamilast is a methoxy group and for roflumilast it is a difluoromethoxy group. The Mulliken charge for the hydrogen atom on the methoxy group in piclamilast is 0.03; for the neighboring carbon atom it is

0.105. On the other hand, the Mulliken charge of the two fluorine atoms of the difluoromethoxy group in roflumilast is -0.151 and for the neighboring carbon atom it is 0.323. Thus roflumilast has a dipole moment at this position whereas piclamilast does not. The key contact residues for the difluoromethoxy or methoxy groups in the Q1 pocket of TbrPDEB1 are W836, N825, S833 and V826. Noting that the carbonyl groups on the side chain of N825 and the carbonyl groups of the backbone of W836 and S833 are pointing into the Q1 pocket, the Q1 pocket might have a slightly negative charge and thus the electronegative F atoms of roflumilast may lead to weaker binding in the Q1 pocket. Hence, electrostatic interaction between the compounds and

TbrPDEB1 may strongly affect the potency of the compounds.

To further investigate the interaction between piclamilast analogs and TbrPDEB1,

Prime/MM-GBSA was applied to calculate the free binding energy of the ligands with the protein. Prime/MM-GBSA can provide much more accurate measurement of the free binding energy between the ligands and a protein, compared with the scoring functions of docking. The

107 calculated free binding energy presented a trend similar to that of the glide docking score (see

Table 3-2), though it does not show superiority in the compound ranking as expected, due to the high standard deviation in the experimental data. Nevertheless, the Prime/MM-GBSA calculation gave individual energy terms between the ligands and the protein. It provided a better understanding of which energy terms have more impact on free binding energy. It has confirmed that piclamilast is more favorable to interact with TbrPDEB1 with a ΔG value of

-33.29 kcal/mol; compared to roflumilast with a ΔG value of -28.48 kcal/mol. It has also shown that the van der Waals and electrostatic interactions of the two compounds are different and may affect their potency against TbrPDEB1. For example, the energies of the Columbic interaction for roflumilast is -0.620 kcal/mol while for piclamilast it is -5.59 kcal/mol, likely caused in part by the differences in the interaction energies of the difluoromethoxy group in roflumilast and the methoxy group in piclamilast with the Q1 pocket of TbrPDEB1. Moreover, the differences in the van der Waals interaction energies are slightly smaller -44.858 kcal/mol for roflumilast and

45.394 kcal/mol for piclamilast. This possibly results from the interaction difference of the cyclopropylmethyl group in roflumilast and the cyclopentyloxy group in piclamilast with the Q2 pocket of TbrPDEB1. In sum, these factors result in difference in the free binding energy and this may explain the differences in potency for the compounds.

108

3.3.6 The specificity of the compounds Despite TbrPDEB1 and human PDE4 showing overall similarity in their active sites, comparison of the homology model with human PDE4-inhibitor crystal structures revealed multiple differences. These differences may explain the lower inhibitory potency we observe in the trypanosome enzymes, compared to those reported against human PDE4B and PDE4D. For example, the cyclopentyl ring of piclamilast and that of rolipram and the cyclopropylmethyl group of roflumilast are buried in the lipophilic Q2 pocket in human PDE4, whereas they are predicted to fill incompletely the P-pocket in TbrPDEB1. The Q1 pocket, which accepts the methoxy groups of piclamilast and rolipram, and the difluoromethyl group of roflumilast, shows subtle differences in polarity. There are a total of five residues in the Q pocket that are not conserved between human PDE4B and TbrPDEB1 that represent significant differences in shape, polarity, hydrogen-bonding capability, hydrophobicity/hydrophilicity, and/or polarizability: P396,

Y403, T407, M411, and S442 in human PDE4B are substituted by V826, S833, A837, T841, and

G873, respectively, in TbrPDEB1 (Figure 3-5). These changes are likely to affect the binding properties of the Q-pocket region. Finally, the metal binding pocket is slightly more closed in the TbrPDEB1 model compared to hPDE4B and the observed amino acid difference (S786 to

A348) may alter the interactions between the metal (via intervening water molecules) and the dichloropyridine nitrogen atom of roflumilast and piclamilast or the pyrrolidone headgroup of rolipram. Cumulatively, these relatively small changes may account for the observed differences in potencies between these similar compounds.

In addition to the amino acid differences in the Q1, Q2, and M pockets which lead to the binding differences of the catechol analogs of TbrPDEB1 and human PDE4, additional different residues have been revealed to be of importance in Table 3-3. Table 3-3 was generated and arranged in a

109 similar way to Table 3-2. It describes the structurally aligned active sites corresponding to protein biochemical functions or ligand binding. In total, 36 positions in the active sites have been aligned to human PDE4 and TbPDEs. Ten out of the 36 residue positions have been observed to be different between human PDE4 and TbrPDEBs. It has been assumed that the differences in these positions in the human PDE4 and TbrPDEBs may provide opportunity to gain selectivity of inhibitors towards the parasite protein over the human protein. On the other hand, these differences in the active site may prove to be more substantial, such to actually prevent proven hPDE4 chemotypes from favorable binding to TbrPDEB. Thus, the question of the efficiency of repurposing hPDE4 chemotypes as anti-trypanosomal agents remains open.

110

Figure 3-5. The ligand binding pocket of TbrPDEB1 (colored in magenta) structurally aligned with human PDE4 (colored in blue). The backbones are rendered as ribbons. The non-conserved important active site residues between TbrPDEB1 and human PDE4 are shown explicitly as sticks.

111

3.4 Conclusions In this project, we used molecular modeling techniques to facilitate the target repurposing of human PDE inhibitors towards TbrPDEB1. The homology models of TbrPDEB1 and

TbrPDEB2 were established and validated. The comparison of these two models has revealed that the TbrPDEB1 and TbrPDB2 share high similarity in primary sequences, model structures and their local active sites. Thus, it would seem feasible to develop compounds that can inhibit the two TbrPDEBs simultaneously. The TbrPDEB1 model proved to be a useful tool for structure-based inhibitor design and helped to rationalize the interaction between TbrPDEB1 and catechol ether-containing compounds. The comparison between human PDE4 enzymes and

TbrPDEB1 has shown differences in their active site residues and thus has provided the opportunity to develop inhibitors with specificity. In sum, the homology modeling and docking studies for TbrPDEB1 have provided a tool for visualizing the binding site and developing hypotheses about structure-activity relationships.

112

Table 3-3. The structural alignment of the important active site residues for human PDE4, LmjPDEB and TbPDEBs. Residue differences between TbrPDEBs and human PDE are marked with an asterisk *. * TbrPDEB1 Y668 H669 H673 H709 D710 H713 N717 N718 L741 E742 H745 TbrPDEB2 Y668 H669 H673 H709 D710 H713 N717 N718 L741 E742 H745 hPDE4 Y233 H234 H238 H274 D275 H278 S282 N283 L303 E304 H307 L.mPDEB1 Y680 H681 H685 H721 D722 H725 N729 N730 L753 E754 H757

* * * TbrPDEB1 T783 D784 M785 A786 D822 I823 S824 N825 V826 S833 W836 TbrPDEB2 T783 D784 M785 A786 D823 I824 S825 N826 V827 S834 W837 hPDE4 T345 D346 M347 S348 D392 L393 S394 N395 P396 Y403 W406 L.mPDEB1 T795 D796 M797 A798 D835 V836 S837 N838 V839 S846 W849

* * * * * * TbrPDEB1 A837 V840 T841 E842 E843 F844 M861 F862 Q874 G873 Q874 TbrPDEB2 A838 V841 T842 E843 E844 F845 M862 F863 Q875 G874 Q875 hPDE4 T407 I410 M411 E413 F414 F415 M432 C433 Q444 S442 Q443 L.mPDEB1 A850 V853 T854 E855 E856 F857 M874 F875 Q887 G886 Q887

* TbrPDEB1 I875 F877 V881 TbrPDEB2 I876 F878 V882 hPDE4 V447 F447 I451 L.mPDEB1 I888 F890 V894 M binding pocket Q switch and P clamp pocket S filled side pocket

113

3.5 References

1. Hendrix, M.; Kallus, C., Phosphodiesterase Inhibitors: A Chemogenomic View. In

Chemogenomics in Drug Discovery, Wiley-VCH Verlag GmbH & Co. KGaA: 2005; pp 243-288.

2. Manallack, D. T.; Hughes, R. A.; Thompson, P. E., The next generation of phosphodiesterase inhibitors: Structural clues to ligand and substrate selectivity of phosphodiesterases. J. Med. Chem. 2005,

48 (10), 3449-3462.

3. Card, G. L.; England, B. P.; Suzuki, Y.; Fong, D.; Powell, B.; Lee, B.; Luu, C.; Tabrizizad, M.;

Gillette, S.; Ibrahim, P. N.; Artis, D. R.; Bollag, G.; Milburn, M. V.; Kim, S. H.; Schlessinger, J.; Zhang,

K. Y. J., Structural basis for the activity of drugs that inhibit phosphodiesterases. Structure 2004, 12 (12),

2233-2247.

4. Huang, Z.; Ducharme, Y.; Macdonald, D.; Robichaud, A., The next generation of PDE4 inhibitors. Current Opinion in Chemical Biology 2001, 5 (4), 432-438.

5. Francis, S. H.; Blount, M. A.; Corbin, J. D., Mammalian Cyclic Nucleotide Phosphodiesterases:

Molecular Mechanisms and Physiological Functions. Physiological Reviews 2011, 91 (2), 651-690.

6. Yang, S.-W.; Smotryski, J.; McElroy, W. T.; Tan, Z.; Ho, G.; Tulshian, D.; Greenlee, W. J.;

Guzzi, M.; Zhang, X.; Mullins, D.; Xiao, L.; Hruza, A.; Chan, T.-M.; Rindgen, D.; Bleickardt, C.;

Hodgson, R., Discovery of orally active pyrazoloquinolines as potent PDE10 inhibitors for the management of schizophrenia. Bioorganic & Medicinal Chemistry Letters 2012, 22 (1), 235-239.

7. Kunz, S.; Beavo, J. A.; D'Angelo, M. A.; Flawia, M. M.; Francis, S. H.; Johner, A.; Laxman, S.;

Oberholzer, M.; Rascon, A.; Shakur, Y.; Wentzinger, L.; Zoraghi, R.; Seebeck, T., Cyclic nucleotide specific phosphodiesterases of the kinetoplastida: A unified nomenclature. Molecular and Biochemical

Parasitology 2006, 145 (1), 133-135.

8. Oberholzer, M.; Marti, G.; Baresic, M.; Kunz, S.; Hemphill, A.; Seebeck, T., The Trypanosoma brucei cAMP phosphodiesterases TbrPDEB1 and TbrPDEB2: flagellar enzymes that are essential for parasite virulence. FASEB J 2007, 21 (3), 720-31.

114

9. Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J.,

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids

Res. 1997, 25 (17), 3389-402.

10. Krieger, E.; Joo, K.; Lee, J.; Raman, S.; Thompson, J.; Tyka, M.; Baker, D.; Karplus, K.,

Improving physical realism, stereochemistry, and side-chain accuracy in homology modeling: Four approaches that performed well in CASP8. Proteins 2009, 77 Suppl 9, 114-22.

11. Wang, H.; Yan, Z.; Geng, J.; Kunz, S.; Seebeck, T.; Ke, H., Crystal structure of the Leishmania major phosphodiesterase LmjPDEB1 and insight into the design of the parasite-selective inhibitors. Mol.

Microbiol. 2007, 66 (4), 1029-38.

12. Laskowski, R. A.; Macarthur, M. W.; Moss, D. S.; Thornton, J. M., PROCHECK a program to check the sterochemical quality of protein structures. Journal of Applied Crystallography 1993, 26, 283-

291.

13. Chen, V. B.; Arendall, W. B.; Headd, J. J.; Keedy, D. A.; Immormino, R. M.; Kapral, G. J.;

Murray, L. W.; Richardson, J. S.; Richardson, D. C., MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallographica Section D-Biological Crystallography 2010, 66,

12-21.

14. Eisenberg, D.; Luthy, R.; Bowie, J. U., VERIFY3D: Assessment of protein models with three- dimensional profiles. Macromolecular Crystallography, Pt B 1997, 277, 396-404.

15. Wei, Y.; Ko, J.; Murga, L. F.; Ondrechen, M. J., Selective prediction of interaction sites in protein structures with THEMATICS. Bmc Bioinformatics 2007, 8.

16. (a) Tong, W.; Wei, Y.; Murga, L. F.; Ondrechen, M. J.; Williams, R. J., Partial Order Optimum

Likelihood (POOL): Maximum Likelihood Prediction of Protein Active Site Residues Using 3D Structure and Sequence Properties. Plos Computational Biology 2009, 5 (1); (b) Somarowthu, S.; Yang, H.;

Hildebrand, D. G. C.; Ondrechen, M. J., High-performance prediction of functional residues in proteins with machine learning and computed input features. Biopolymers 2011, n/a-n/a.

115

17. Sankararaman, S.; Sjoelander, K., INTREPID-INformation-theoretic TREe traversal for Protein functional site IDentification. Bioinformatics 2008, 24 (21), 2445-2452.

18. Di Tommaso, P.; Moretti, S.; Xenarios, I.; Orobitg, M.; Montanyola, A.; Chang, J.-M.; Taly, J.-F.;

Notredame, C., T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension. Nucleic Acids Research 2011, 39, W13-

W17.

19. Krissinel, E.; Henrick, K., Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallographica Section D-Biological Crystallography

2004, 60, 2256-2268.

20. Card, G. L.; England, B. P.; Suzuki, Y.; Fong, D.; Powell, B.; Lee, B.; Luu, C.; Tabrizizad, M.;

Gillette, S.; Ibrahim, P. N.; Artis, D. R.; Bollag, G.; Milburn, M. V.; Kim, S.-H.; Schlessinger, J.; Zhang,

K. Y. J., Structural Basis for the Activity of Drugs that Inhibit Phosphodiesterases. Structure 2004, 12

(12), 2233-2247.

21. Oberholzer, M.; Marti, G.; Baresic, M.; Kunz, S.; Hemphill, A.; Seebeck, T., The Trypanosoma brucei cAMP phosphodiesterases TbrPDEB1 and TbrPDEB2: flagellar enzymes that are essential for parasite virulence. Faseb Journal 2007, 21 (3), 720-731.

22. Thompson, J. D.; Higgins, D. G.; Gibson, T. J., CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22 (22), 4673-80.

23. Somarowthu, S.; Yang, H. Y.; Hidebrand, D. G. C.; Ondrechen, M. J., High-Performance

Prediction of Functional Residues in Proteins with Machine Learning and Computed Input Features.

Biopolymers 2011, 95 (6), 390-400.

24. Wang, H.; Yan, Z.; Geng, J.; Kunz, S.; Seebeck, T.; Ke, H., Crystal structure of the Leishmania major phosphodiesterase LmjPDEB1 and insight into the design of the parasite-selective inhibitors.

Molecular Microbiology 2007, 66 (4), 1029-1038.

116

25. Wang, H.; Yan, Z.; Geng, J.; Kunz, S.; Seebeck, T.; Ke, H., Crystal structure of the Leishmania major phosphodiesterase LmjPDEB1 and insight into the design of the parasite-selective inhibitors. Mol

Microbiol 2007, 66 (4), 1029-38.

26. Bland, N. D.; Wang, C.; Tallman, C.; Gustafson, A. E.; Wang, Z.; Ashton, T. D.; Ochiana, S. O.;

McAllister, G.; Cotter, K.; Fang, A. P.; Gechijian, L.; Garceau, N.; Gangurde, R.; Ortenberg, R.;

Ondrechen, M. J.; Campbell, R. K.; Pollastri, M. P., Pharmacological Validation of Trypanosoma brucei

Phosphodiesterases B1 and B2 as Druggable Targets for African Sleeping Sickness. Journal of Medicinal

Chemistry 2011, 54 (23), 8188-8194.

27. Friesner, R. A.; Murphy, R. B.; Repasky, M. P.; Frye, L. L.; Greenwood, J. R.; Halgren, T. A.;

Sanschagrin, P. C.; Mainz, D. T., Extra Precision Glide: Docking and Scoring Incorporating a Model of

Hydrophobic Enclosure for Protein−Ligand Complexes. J. Med. Chem. 2006, 49 (21), 6177-6196.

117

CHAPTER 4 STRUCTURALLY ALIGNED LOCAL SITES OF ACTIVITY (SALSAS) FOR FUNCTIONAL CHARACTERIZATION OF ENZYME STRUCTURES IN THE CONCANAVALIN A-LIKE LECTINS/GLUCANASES SUPERFAMILY

118

4.1 Introduction The Protein Structure Initiative (PSI) has led to significant growth in the number of protein structures in the Protein Data Bank (PDB) 1 . However, most of the new structures emerging from structural genomics efforts are of unknown or uncertain functions. There are now over 11,000 structural genomics (SG) proteins that have been deposited, mostly listed as

“putative,” “unknown function", or as “hypothetical” proteins 2. One of the most challenging tasks to emerge from the PSI is to discover the biochemical function of proteins based on their three-dimensional structures. High throughput structure determination is producing new information so rapidly that experimental characterization of the all of the protein structures of unknown function is not feasible. Therefore, computational methods are needed for reliable and large-scale annotation of the biochemical functions of protein structures.

The most general computational approach to determine the biochemical function of a protein is to transfer the function of the query proteins from that of experimentally validated proteins based on the protein sequence or structure similarity (particularly overall fold).

However, functional assignment from the sequence similarity of proteins is often misleading.

The fundamental assumption for homology-based methods is that evolutionary proximity implies shared or similar function. But even if a pair of proteins shares up to 70% sequence identity, in

10% of cases the two proteins interact with different substrates, and when two proteins share a sequence similarity of 50% or lower, they commonly have different functions 3. Recently, it has been reported that in several public protein databases, the functional misannotation rates averages between 5% and 63% using members of the six superfamilies from the Structure-

Function Linkage Database (SFLD) as the gold standard 4.

Determination the biochemical functions of a protein based on its structure may not be reliable either, because proteins sharing a similar fold may have completely different functions.

119

One well-known example is the set of enzymes that have the TIM (triosephosphate isomerase) barrel fold. In the superfamily database 5, more than 33 superfamilies are comprised of the TIM beta/ alpha-barrel fold, and these superfamilies represent dozens of diversified biochemical functions. In the enolase superfamily there are eight subgroups, including enolase and other metabolically specialized enzymes, i.e., mandelate racemase, galactonate dehydratase, glucarate dehydratase, muconate-lactonizing enzyme, N-acylamino acid racemase, β-methylaspartate ammonia-lyase, and o-succinylbenzoate synthase6. Our previous study of the proteins in the DJ1 superfamily showed that these proteins have very high similarity in the core fold and a conserved cysteine residue in the known or presumed active sites, leading to the incorrect conclusion that these proteins are likely to have a same biochemical function. Yet a more detailed study of their active sites revealed differences in their predicted active sites, suggesting that they belong to multiple functional subgroups 7. On the other hand, proteins with different sequences or structural folds may possess identical functions. This is very well illustrated by the glycoside hydrolases, a set of proteins with the same or similar biochemical function spanning several completely different fold types, including the (α/α)6-barrel, to the (β/α)8-barrel, to the 5-bladed β- propeller8.

To investigate the biochemical functions of proteins, a number of computational methods have been developed for the prediction of the functionally important residues, i.e. catalytic and ligand binding residues. For instance, ConSurf 9 utilizes sequence conservation and

Evolutionary Trace 10 uses a phylogenetic tree to detect the important residues. A newer method,

INTREPID, utilizes phylogenetic tree traversal and Jensen-Shannon divergence for enhanced performance in the prediction of the functionally important residues 11. Protein surface topography has also been used as an input feature to identify functionally important residues, as

120 in Ligsite 12, Surfnet13, CASTp14, and ConCavity15. Solvent mapping, PocketFinder, and Q- siteFinder 16 are docking based methods using small molecule probes to identify the regions of small molecule interaction on the protein surface. THEMATICS (Theoretical Microscopic

Anomalous Titration Curve Shapes) uses the computed electrostatic potential of the protein structure, obtaining the theoretical titration curves for each of the ionizable residues (Arg, Asp,

Cys, Glu, His, Lys and Tyr) to identify functionally important residues by their anomalous titration behavior 17. The first derivative functions of the computed titration curves are essentially probability distribution function. Deviations of the computed titration curves from the ideal

Henderson-Hasselbalch shape are quantified using metrics describing these first derivative functions: the third and fourth central moments, μ3, and μ4, and the theoretical buffer range.

A machine learning method that utilizes THEMATICS data as input features has been developed for high performance prediction of functionally important residues. This method,

Partial Order Optimum Likelihood (POOL), calculates a score for all residues that is proportional to the probability that residue is in the active site of the protein 18. While THEMATICS predicts the seven types of residues that can transfer protons, POOL predicts all 20 amino acid types through the use of environment variables, distant-dependent measures of the THEMATICS μ3,

μ4 and buffer range values in the neighborhood of all residues. POOL also incorporates additional optional input features such as surface topography, sequence conservation scores, and phylogenetic information to improve its overall performance for functional annotation. POOL performance has been validated on the CSA-100, a manually curated benchmark database, based on published experimental data, for functional residue annotation19.

Starting with the computational active site predictors POOL and THEMATICS developed in the Ondrechen group, I was a part of a team that developed a new method for

121 predicting the biochemical functions of proteins, Structurally Aligned Local Sites of Activity

(SALSA). The other members of this team who developed the SALSA method are Joslynn Lee,

Ramya Parasuram, Pengcheng Yin, and Dr. Srinivas Somarowthu. In this thesis, SALSA has been applied to analyze the concanavalin A-like lectin/glucanase superfamily as a model system.

Here we focus on the enzyme members of the concanavalin A-like lectins/glucanases superfamily, because they play important roles in cellular component organization or biogenesis, cellular metabolic processes, cellular developmental process, and cellular localization20. The concanavalin A-like lectin/glucanase superfamily is comprised of proteins with a similar core fold. They consist of a sandwich of 12-14 anti-parallel beta strands in two curved sheets and catalytic sites locating in the concave cleft (see Figure 4-1). Although the proteins share high similarity in the core fold, the sequence identities across the proteins are low in general and exhibit a variety of different biochemical functions including multiple types of glycoside hydrolases, peptidase and lyases. Due to low sequence identity and similar core structures of these proteins, it is not feasible to annotate the biochemical functions of the proteins via an overall sequence or overall structure based function transfer method. There are plenty of data on catalytic mechanisms and on ligand-protein interactions for the enzymes in the superfamily, as well as abundant 3D structures in the majority of the subgroups, facilitating the test and validation of the new biochemical function classification method 33-37. Therefore, the superfamily has been selected for proof of concept of the SALSA method and for the examination of the relationship between local protein structure and biochemical function. As a result, it will be shown that the superfamily has been successfully sorted into seven functional subgroups: glycoside hydrolase-16, glycoside hydrolase-11, glycoside hydrolase-12, glycoside hydrolase-7, glycoside hydrolase-54, peptidase A4, and alginate lyase, using SALSA.

122

Consensus signatures have been constructed for each subgroup to serve as templates for functional assignment. The consensus signatures are then used to predict the functions of the seven SG proteins in the superfamily. The putative biochemical function of one of the SG proteins has been confirmed to be correct. However, the other four SG proteins, all putative glycoside hydrolases, are shown to have predicted local sites that do not match any of the consensus signatures of the known glycoside hydrolases in the superfamily. Thus they could be new functional subgroups within the superfamily.

123

90°

Figure 4-1. 3D structure of a representative protein in Concanavalin A-like Lectins/Glucanases Superfamily (endoglucanase from humicola grisea, PDB ID: 1uu4). The core structure for the proteins in the superfamily consists of two major anti-parallel, curved β-sheets arranged in the form of a sandwich, with a number of loops interconnecting the sheets as well as the strands between them.

124

4.2 Methods

4.2.1. The Superfamily Dataset Selection An initial set of functionally characterized members of known structure in the

Concanavalin A-like lectins/glucanases superfamily (SCOP ID: 49899) was obtained from the

PASS2 database (http://caps.ncbs.res.in/pass2)21, Additional protein structures were selected using a Dali22 search to obtain wider representation in each subgroup, in order to construct the consensus signatures. Finally, proteins were eliminated from the dataset so that no two proteins share sequence identity higher than 65%.

The 3D coordinates of most proteins selected in the datasets were downloaded from the protein data bank (http://www.rcsb.org)1 and preprocessed by deleting multiple duplicated chains and adding hydrogen atoms using the molecular modeling software suite YASARA23 prior to the

THEMATICS/POOL calculations. In the case of subgroup glycoside hydrolase-54, for which an additional structure with lower sequence identity (<60%) was needed, a comparative model structure was obtained. This target sequence (UniProt ID: G8SJ69) from Actinoplanes sp.24 was obtained from UniProt 25. Then this sequence was searched against Protein Model Portal

(http://www.proteinmodelportal.org/)26and the top ranked model with the best reported quality was added into the subgroup. The final list of proteins used for characterizing the different functional subgroups is contained in Table 4-1.

125

Table 4-1. A list of proteins analyzed in the concanavalin A-like Lectins/Glucanases Superfamily. SG stands for structural genomics protein.

PDB Name Source Subgroup 1,3-1,4-beta-d-glucan 4 2ayh Glucanohydrolase Hybrid Bacillus GH16 3ilf Porphyranase A Zobellia galactanivorans GH16 1dyp Kappa-carrageenase Pseudoalteromonas carrageenovora GH16 2vy0 Endo-beta-1,3-glucanases Pyrococcus furiosus GH16 1mve 1,3-1,4-beta-D-glucanase Fibrobacter succinogenes GH16

Cellobiohydrolase Cel7A(equal 1cel, 2cel, 8cel), (mutate for above proteins 1dy4 e233s, a224h,l225v,t226a,d262g) Trichoderma reesei GH16 2rfw Cellobiohydrolase Melanocarpus albomyces GH7 1z3t Cellobiohydrolase Cel7D Phanerochaete chrysosporium GH7 1j1t Alginate lyase Alteromonas sp.272 GH7 1uai Alginate lyase Corynebacterium sp lyase 1vav Alginate lyase Pseudomonas aeruginosa lyase 1h8v Endo-beta-1,4-glucanase Trichoderma reesei lyase 2nlr Endoglucanase CelB2 Streptomyces lividans GH12 1uu4 Humicola grisea Cel12A Aspergillus niger GH12 1m4w Thermophilic b-1,4-xylanase Nonomuraea flexuosa GH12 1h4g Xylanase Bacillus agaradhaerens GH11 1bcx Xylanase Bacillus circulans GH11 2ifr Scytalidopepsin B Scytalidium lignicola peptidase 1y43:B Aspergilloglutamic peptidase Aspergillus niger peptidase Alpha-L-arabinofuranosidase B, N- 1wd3 terminal domain Aspergillus kawachii GH54 G8SJ69 Alpha-L-arabinofuranosidase B Actinoplanes sp. (strain ATCC 31044 / (model) catalytic CBS 674.73 / SE50/110) GH54 3osd (SG) Putative glycosyl hydrolase Bacteroides thetaiotaomicron 3nmb(SG) Putative sugar hydrolase Bacteroides ovatus 3h3l(SG) Putative glycoside hydrolase Parabacteroides distasonis atcc 8503 3hbk(SG) Putative glycoside hydrolase Parabacteroides distasonis atcc 8503

3rq0 (SG) A glycosyl hydrolases Mycobacterium smegmatis str. MC2 155

126

4. 2.2. SALSAs Methodology

The general scheme of the SALSAs methodology is shown in Figure 4-2. First,

THEMATICS/POOL calculations are performed to determine the functionally important residues for each of the previously characterized members of the superfamily, with representation from each of the subgroups known to have enzymatic function. THEMATICS analysis was performed on the protein structures according to the procedures described previously 27, using a Z score cutoff value of 0.99 in the statistical analysis and using a distance cutoff of 9.0 Å. For the POOL calculations, µ3, µ4 and the buffer range from THEMATICS and geometric features from the structure-only version of ConCavity15 were used as input features for the POOL calculation19. The top 8% of the POOL ranked residues are taken as functionally important residues for present purposes.

Second, these predicted local sites of activity are then structurally aligned in order to identify the similarities and differences among the local functional sites of the different enzymatic subgroups. Three-dimensional structural alignments are performed using T-Coffee

Expresso (3D Coffee) 28, Combinatorial Extension (CE) 29, and PDBefold

(http://www.ebi.ac.uk/msd-srv/ssm/ssmstart.html) in combination with a manual alignment. The manual alignment is sometimes necessary because the automated alignment programs can break down if the number of structures to align is large.

After alignment of the local active sites, these active sites are classified into subgroups, based on active site similarity. The scoring matrix, described in the next subsection, is used in this classification process. Proteins with similar local active sites are grouped together and the consensus signatures for each subgroup are generated as templates. For a query protein with unknown function, its predicted local active site (using POOL) is classified and analyzed by a

127 search against the consensus signatures of each subgroup. The difference between the predicted set of residues for the query protein and signature templates is quantified through scoring matrices.

For each subgroup, consensus signatures are then determined using majority vote. These consensus signatures represent the specific residue types in specific spatial positions that are common to a majority of the members of a given functional subgroup. In the present analysis, each of the biochemical functional subgroups has a unique signature that can serve as a local 3D template to compare to the predicted local active site of a protein with unknown function in the subsequent step. Next, THEMATICS/POOL is used to predict the local active site residues for the SG proteins of unknown function in the superfamily. These predicted sites are compared with the sets of consensus signatures to identify the best match and thus to attempt to determine the biochemical functional subgroup to which SG proteins should be assigned. The best match is determined by a scoring technique on the local set of functionally important residues.

128

Figure 4-2. A schematic diagram of the SALSA method

129

4.2.3 The Scoring Matrix

To quantify the difference and similarity among the predicted local active sites or consensus signatures, matrices are used to score the aligned proteins with each other. Two types of matrices were used for this purpose: 1) BLOSUM62; and 2) a chemical similarity matrix

(CMS) developed by us (see Table S1). For each protein pair, the following matrix elements are defined:

N ∑Wj SS = nj= (1) N

SSproteinA, B Normalized__ match score = SSproteinA, A (2) where SS is the similarity score, Wj is the pairwise score between the predicted local sites of the two proteins at position Pj obtained from scoring matrix, and N is the number of positions locally aligned. Normalized match score is a value to characterize the relative similarity of the functionally important residues, and is defined as the SS of the template protein (or consensus signature) A to the query protein B, normalized by the SS of the template protein A with itself.

Thus, Normalized match score ranges from -1 to 1; a high positive value indicates high correlation between the active sites of the two proteins.

4.3. Results and Discussion

4.3.1 Implementation of SALSA on known Concanavalin A -like lectin/glucanases

The analysis of Concanavalin A-like lectins/glucanases serves as a test set for the SALSA method to access the feasibility of this method. First, POOL was used to predict the local active

130 site residues in the proteins of interest, since POOL has been validated as one of the top performing functional residue predictors30. One advantage of POOL is that, unlike

THEMATICS, it is able to report both ionizable and non-ionizable residues. POOL can also utilize any input features upon which the probability of functional importance depends monotonically. Our previous work has proved that the success of POOL predictions is driven primarily by the THEMATICS input features: inclusion of geometric information about pocket size rank and sequence conservation information each gives a small but statistically significant improvement in POOL performance19. Considering the potential application to the annotation of biochemical functional of SG proteins for which only structural information is available, we exclusively used protein structural information as input for the POOL calculations, i.e., the µ3,

µ4, buffer range metrics from THEMATICs and surface geometry features from the structure- only version of ConCavity, although phylogenetic information (from sequence alignments) is an optional input feature in POOL. Thus for purposes of application to the SALSA method, all of the input features for POOL are those computed from the structure of the query protein alone.

Second, a 3D structural local active site alignment table is generated using the set of active site residues predicted by POOL as shown in Table 4-2A. Tables are constructed with the vertical columns corresponding to protein structures and the rows corresponding to aligned spatial positions that are important for biochemical function in at least some of the members of the superfamily. If one or more proteins in the alignment has a residue in a particular position that has been predicted to be functionally important, then that position is included in the table.

Structurally aligned positions are excluded from the table if they do not include any residues predicted to be functionally important. Thus the SALSA tables depict a 3D structural alignment

131 at the local functional sites and a 3D match between proteins at these sites is indicative of functional similarity.

Table 4-2A shows the 50 important residue positions (represented by rows) for the enzymes of known function in the superfamily. Boldfaced letters are the active site residues predicted by the top 8% of the residues ranked by the POOL calculation. The residues highlighted in red correspond to the reported catalytic residues in the catalytic site atlas (CSA)31.

As a result, POOL succeeds in predicting the known catalytic residues in all of the enzymes of known function, in addition to other potentially important residues, in the concanavalin-A like lectins/glucanases superfamily. The ability of POOL to predict all of the previously identified catalytic and ligand contact residues bodes well for extension to biochemical function prediction for the enzymes in this superfamily. A detailed local active site alignment reveals visually apparent differences in the aligned active sites for the members of the different subgroups in this superfamily. Even without prior knowledge about the functions of these proteins, a classification of protein subgroups can be achieved using the local active site alignment shown in Table 4-2A.

The proteins with known biochemical functions have been sorted into six subgroups, including glycoside hydrolases -11 (GH11), glycoside hydrolases -12 (GH12), glycoside hydrolases -

7(GH7), glycoside hydrolases -16 (GH16), lyase and peptidase A4.

4.3.2 Scoring matrix quantification of the SALSA table

To further access the differences and similarities between pairs of proteins in the superfamily, we implemented two matrices to quantify the similarities between local active sites in Table 4-2A, including BLOSUM6232 and a chemical similarity matrix (CSM) developed in house (see appendix ). Table 4-2B presents the resultant normalized match scores for the protein structures

132 in Table 4-2A using BLOSUM 62 as the scoring matrix. The normalized match scores measure the degree of similarity between the local active site residues for the aligned proteins.

Normalized match scores larger than 0.74 are observed in the GH11, GH12, GH7 and peptidase

A4 subgroups. The slightly lower normalized match scores are observed in the lyase and GH-16 subgroups with values ranging from 0.41-0.55 and 0.21-0.53 respectively. For proteins classified into different subgroups, normalized match scores are observed to be lower than 0.20 and even negative. A comparison of normalized match scores to the sequence identities of proteins in the subgroups suggests the normalized match scores, which represent the similarity of the local active sites which are functionally important, can be much higher than the sequence identities between these proteins. For instance, the sequence identity scores within the lyase subgroup are around 0.15- 0.25 (Table 4-3); while the normalized match scores of the proteins are much higher (in the 0.41-0.56 range), reflecting the expected enrichment of the similarity at the functionally important local active sites. The normalized match scores between subgroups provide a clear distinction between the proteins in the different functional subgroups in the

Concanavalin -A like lectin and glucanase superfamily. Therefore, combing the SALSA method with scoring matrices affords an efficient way to obtain pairwise quantification of the local active site residues for proteins in our verification dataset. Improvement in normalized match score has been observed by applying the CSM for scoring the protein pairs, likely because CSM considers the chemical similarity of residues, not the identity of the residues at the local active sites.

However, BLOSUM62 gives better distinction across the subgroups in terms of normalized match scores, compared to the chemical similarity matrices (Table 4-2B vs. C). Thus, of the scoring systems studies, BLOSUM62 is regarded as the best scoring matrix for quantification of the local active site similarity in the concanavalin-A like superfamily.

133

Table 4-2. (A) The 3D structural alignment of the predicted local active site residues for proteins in the concanavalin A-like lectins/ glucanases superfamily. Each column represents a protein of interest and each row represents a structurally aligned spatial position. The residues colored in red correspond to the catalytic residues reported in the Catalytic Site Atlas (CSA). The residues in boldface are the functionally important residues as predicted by POOL using an 8% cutoff. (GH stands for glycoside hydrolases). The residues highlighted in blue are the members of the proposed consensus signature of the predicted local active sites for subgroups in concanavalin A- like lectin/ glucanases superfamily # of aligned position hydrolase‐11 hydrolase 12 hydrolase 7hydrolase 16 lyase peptidase‐4 1m4w 1h4g 1bcx 1uu4 1h8v 2nlr 1z3t 1dy4 2rfw 2ayh 1dyp 3ilf 2vy0 1mve 1uai 1j1t 1vav 2ifr 1y43 1 W20 W19 W9 N22 N20 N22G23G23G23E8D47D37D52‐ T19 E21 T10 S4 ‐ 2D22 D21 D11 W24 W22 W24 C25 C25 C25 D22 T61 D51 K68 ‐‐N24 P12 ‐‐ 3 ‐‐‐‐ ‐‐‐‐‐S25 ‐ P54 N70 ‐‐‐‐ 4 ‐‐‐‐ ‐‐‐‐‐N26 ‐ W56 G71 ‐ T65 T45 T52 ‐‐ 5 ‐‐‐‐ ‐‐‐‐‐W34 W69 F64 Y87 ‐ ‐‐‐‐‐ 6 N46 N45 N35 K62 K58 K55 S103 A106 A106 C61 E62 Y52 S120 G2 R72 R52 R59 G7 ‐ 7 V48 L47 V37 Y64 Y60 Y57 R104 R107 R107 E63 V114 A102 R122 E4 S73 H53 S60 ‐‐ 8 A49 F48 V38 P65 Q61 P58 V105 L108 F108 Y64 A115 A103 L123 L5 E74 E54 E61 ‐‐ 9 G50 R49 G39 Y66 N62 S59 L107 L110 L110 S66 S117 S105 T125 T7 R76 K56 R63 A9 ‐ 10 N72 N79 N63 R97 R93 V98 P137 P134 E137 V88 V139 S128 W150 ‐‐‐‐‐ ‐ 11 Y74 Y81 Y65 N99 N95 N100 G139 G142 S142 S90 A142 F130 A152 S31 V114 I87 I99 T37 S8 12 L77 V84 L68 Y102 Y98 Y103 Y142 Y145 Y145 F92 W144 W131 W154 F40 Q117 Q90 Q102 W39 W10 13 Y78 Y85 Y69 D103 D99 D104 L143 F146 F146 Y94 Y146 R133 L156 Y42 H119 H92 H104 G41 G12

14 W‐14 W87 W71 F105 F101 W106 A145 S148 A148 T95 S147 V134 G157 Q53 D120 A93 S105 D43 D14 15 ‐‐‐‐ ‐‐Y168 Y171 Y171 ‐‐‐‐‐‐‐‐‐‐ 16 ‐‐‐‐ ‐‐D170 D173 D173 ‐‐‐‐‐‐‐‐‐‐ 17 ‐‐‐‐ ‐‐S171 S174 A174 ‐‐‐‐‐‐‐‐‐‐ 18 E87 E94 E78 E120 E116 E120 Q172 Q175 Q175 W103 Y161 F137 C168 W54 ‐‐‐ 19 Y88 Y95 Y79 L121 L117 I121 E207 E212 E212 D104 S162 S138 G169 V55 D123 G97 A114 T49 T20 20 Y89 Y96 Y80 M122 M118 M122 M208 M213 M217 E105 E163 E139 E170 E56 D124 T98 P115 A50 ‐ 21 I90 I97 V81 I123 I119 I123 D209 D214 D214 I106 I164 I140 I171 V57 V125 I99 L116 I51 ‐ 22 ‐‐‐‐ ‐‐I210 I215 V215 D107 D165 D141 D172 D58 T126 S100 V117 L52 ‐ 23 W94 W101 W85 R127 K123 R127 W211 W216 W216 I108 ‐‐‐‐V127 K101 K118 Q53 Q24 24 E212 E217 E217 E109 E168 E144 E175 E60 R129 Y103 L120 G55 G26 25 G95 G102 G86 Y128 Y124 V128 ‐‐‐L111 ‐‐L177 ‐ 26 ‐‐‐‐ ‐‐N215 S220 A220 G112 Q171 G147 G178 G63 T133 F122 R124 D57 D28 27 ‐‐‐‐ ‐‐‐‐‐‐K172 ‐ ‐ ‐ R129 D65 D39 28 ‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐W67 W41

29 R98 R105 R89 Y132 G128 Q132 A217 S222 A222 K117 E177 Q161 T183 S68 D141 G133 D140 E69 E43

30 P99 P106 P90 P133 P129 P133 A219 A224 A224 Q119 D179 H163 H185 Q70 T143 E136 R144 Y71 Y45

31 T100 P107 T91 I134 I130 L134 T221 T226 T226 N121 D181 N165 H189 N72 H145 K137 A145 P72 P46 32 R121 R129 R112 N155 N151 N158 H223 H228 H228 Y123 H183 H167 Y193 I74 ‐‐‐‐‐ 33 P125 P133 P116 ‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 34 S126 S134 S117 ‐‐‐C238 C243 C243 ‐‐‐‐‐‐‐‐‐‐ 35 I127 I135 I118 ‐‐‐A239 G244 G244 ‐‐‐‐‐‐‐‐‐‐ 36 ‐‐‐‐ ‐‐R240 R251 R251 ‐‐‐‐‐‐‐‐E136 E110 37 F133 F141 F125 M158 M154 N158 ‐‐‐‐‐‐‐Q81 ‐‐‐D137 D111 38 Q135 Q143 Q127 V160 V156 V160 D246 D257 D257 V127 M195 L175 I199 K82 ‐‐‐F138 F112 39 W137 W145 W129 S162 S158 S162 D248 D259 N259 G128 R196 Q176 T200 T83 ‐‐‐‐‐ 40 S138 S146 S130 F163 F159 F163 C250 C261 C261 G129 P197 P177 R201 ‐‐‐‐‐‐ 41 ‐‐‐‐ ‐‐D251 D262 D262 H130 N204 L178 A202 S84 ‐‐‐‐‐ 42 S146 A180 V168 D172 N167 G171 R256 R267 R267 E131 H205 G179 T200 E85 ‐‐‐ 43 T150 S158 T145 D176 D171 D175 T276 V290 V288 Y147 Y221 Y195 F219 Y101 E159 N150 S159 ‐‐ 44 ‐‐‐‐ ‐‐Q280 Q293 R291 W151 V225 W199 W223 W105 Y189 Y180 Y193 ‐‐ 45 I172 A180 V168 Q201 Q196 Q199 V358 V361 V360 K178 N252 R228 Y259 H99 ‐‐‐‐‐ 46 E176 E184 E172 E205 E200 E203 S362 S365 S364 I179 L253 M229 I260 L136 K191 K182 K195 ‐‐ 47 Y178 Y186 Y174 F207 P201 P204 W364 W367 W366 N182 S256 D232 N263 N139 Y195 Y186 Y199 ‐‐ 48 ‐‐‐‐ ‐‐D366 D369 D368 W184 G258 E234 A265 W141 Q197 Q188 Q201 ‐‐ 49 ‐‐‐‐ ‐‐‐‐‐G186 R260 F236 G267 S143 C200 F211 R204 ‐‐

134

(B) Normalized match scores of the aligned local active sites for the proteins in (A) obtained with the BLOSUM 62 scoring matrix. Blocks along the diagonal with high scores are highlighted in color.

ConA‐like GH11 GH12 GH7 GH16 lyase peptidase glucanases 1m4w 1h4g 1bcx 1uu4 1h8v 2nlr 1z3t 1dy4 2rfw 2ayh 1dyp 3ilf 2vy0 1mve 1uai 1j1t 1vav 2ifr 1y43 1m4w 1 0.86 0.96 0.18 0.17 0.19 0 ‐0.02 0.01 ‐0.09 ‐0.03 ‐0.06 ‐0.13 ‐0.02 ‐0.02 0 0.01 ‐0.01 ‐0.02 GH11 1h4g 0.86 1 0.85 0.17 0.15 0.18 ‐0.05 ‐0.05 0.01 ‐0.09 ‐0.07 ‐0.09 ‐0.13 ‐0.03 0 0 0.06 0.02 0.01 1bcx 0.96 0.85 1 0.16 0.15 0.18 0.01 ‐0.02 0.01 ‐0.1 ‐0.06 ‐0.09 ‐0.13 ‐0.02 ‐0.02 ‐0.02 0 ‐0.03 ‐0.03 1uu4 0.18 0.17 0.16 1 0.82 0.74 ‐0.02 ‐0.04 0 ‐0.08 ‐0.03 ‐0.07 ‐0.0500.040.080.040.030.03 GH12 1h8v 0.17 0.15 0.15 0.82 1 0.76 ‐0.04 ‐0.05 ‐0.01 ‐0.07 0 ‐0.05 ‐0.03 0.02 0.07 0.14 0.06 0.04 0.03 2nlr 0.19 0.18 0.18 0.74 0.76 1 ‐0.04 ‐0.06 ‐0.03 ‐0.02 0.04 0 ‐0.03 0.04 0.05 0.11 0.05 0.1 0.09 1z3t 0 ‐0.05 0.01 ‐0.02 ‐0.04 ‐0.04 1 0.89 0.79 ‐0.11 ‐0.04 ‐0.11 ‐0.07 ‐0.02 ‐0.09 ‐0.1 ‐0.07 ‐0.06 ‐0.06 GH7 1dy4 ‐0.02 ‐0.05 ‐0.02 ‐0.04 ‐0.05 ‐0.06 0.89 1 0.85 ‐0.09 ‐0.03 ‐0.13 ‐0.09 0.02 ‐0.08 ‐0.11 ‐0.08 ‐0.07 ‐0.06 2rfw 0.01 0.01 0.01 0 ‐0.01 ‐0.03 0.79 0.85 1 ‐0.06 ‐0.03 ‐0.11 ‐0.09 0.01 ‐0.08 ‐0.08 ‐0.08 ‐0.08 ‐0.05 2ayh ‐0.09 ‐0.09 ‐0.1 ‐0.08 ‐0.07 ‐0.02 ‐0.11 ‐0.09 ‐0.06 1 0.29 0.29 0.34 0.45 0.11 0.1 ‐0.04 ‐0.11 ‐0.1 1dyp ‐0.03 ‐0.07 ‐0.06 ‐0.03 0 0.04 ‐0.04 ‐0.03 ‐0.03 0.29 1 0.37 0.23 0.26 0.08 0.11 0.07 0.08 0.07 3ilf ‐0.06 ‐0.09 ‐0.09 ‐0.07 ‐0.05 0 ‐0.11 ‐0.13 ‐0.11 0.29 0.37 1 0.32 0.23 0.07 0.08 0.01 0.01 ‐0.01 GH16 2vy0 ‐0.13 ‐0.13 ‐0.13 ‐0.05 ‐0.03 ‐0.03 ‐0.07 ‐0.09 ‐0.09 0.34 0.23 0.32 1 0.3 0.07 0.04 ‐0.04 ‐0.01 ‐0.02 1mve ‐0.02 ‐0.03 ‐0.02 0 0.02 0.04 ‐0.02 0.02 0.01 0.45 0.26 0.23 0.3 1 0.18 0.12 0.06 0.09 0.08 3rq0 ‐0.08 ‐0.12 ‐0.08 ‐0.02 0.01 ‐0.02 ‐0.04 ‐0.05 ‐0.06 0.25 0.21 0.23 0.53 0.21 ‐0.02 0 ‐0.05 ‐0.03 ‐0.03 1uai ‐0.02 0 ‐0.02 0.04 0.07 0.05 ‐0.09 ‐0.08 ‐0.08 0.11 0.08 0.07 0.07 0.18 1 0.55 0.41 0.2 0.19 lyase 1j1t 0 0 ‐0.02 0.08 0.14 0.11 ‐0.1 ‐0.11 ‐0.08 0.1 0.11 0.08 0.04 0.12 0.55 1 0.41 0.13 0.12 1vav 0.01 0.06 0 0.04 0.06 0.05 ‐0.07 ‐0.08 ‐0.08 ‐0.04 0.07 0.01 ‐0.04 0.06 0.41 0.41 1 0.18 0.15 peptid 2ifr ‐0.01 0.02 ‐0.03 0.03 0.04 0.1 ‐0.06 ‐0.07 ‐0.08 ‐0.11 0.08 0.01 ‐0.01 0.09 0.2 0.13 0.18 1 0.9 ase 1y43 ‐0.02 0.01 ‐0.03 0.03 0.03 0.09 ‐0.06 ‐0.06 ‐0.05 ‐0.1 0.07 ‐0.01 ‐0.02 0.08 0.19 0.12 0.15 0.9 1

135

(C) Normalized match scores of the aligned local active sites for the proteins in (A) obtained with the Chemical Similarity Matrix scoring matrix. Blocks along the diagonal with high scores are highlighted in color.

ConA‐like GH11 GH12 GH7 GH16 lyase peptidase glucanases 1m4w 1h4g 1bcx 1uu4 1h8v 2nlr 1z3t 1dy4 2rfw 2ayh 1dyp 3ilf 2vy0 1mve 1uai 1j1t 1vav 2ifr 1y43 1m4w 1 0.84 0.95 0.24 0.21 0.2 ‐0.01 ‐0.02 0.02 ‐0.09 ‐0.04 ‐0.05 ‐0.14 ‐0.03 ‐0.02 0 0 ‐0.01 ‐0.02 GH11 1h4g 0.84 1 0.83 0.22 0.2 0.18 ‐0.05 ‐0.05 0.02 ‐0.07 ‐0.08 ‐0.09 ‐0.14 ‐0.03 0.01 0.02 0.07 0.02 0.01 1bcx 0.95 0.83 1 0.22 0.19 0.18 0 ‐0.01 0.02 ‐0.1 ‐0.07 ‐0.09 ‐0.14 ‐0.02 ‐0.02 ‐0.02 0 ‐0.03 ‐0.03 1uu4 0.24 0.22 0.22 1 0.83 0.76 ‐0.01 ‐0.02 0.02 ‐0.05 0.01 ‐0.06 ‐0.04 0.03 0.07 0.11 0.08 0.05 0.05 GH12 1h8v 0.21 0.2 0.19 0.83 1 0.78 ‐0.04 ‐0.05 ‐0.01 ‐0.05 0.03 ‐0.04 ‐0.02 0.03 0.08 0.16 0.08 0.05 0.05 2nlr 0.2 0.18 0.18 0.76 0.78 1 ‐0.05 ‐0.06 ‐0.03 ‐0.01 0.08 0.02 ‐0.01 0.07 0.06 0.14 0.08 0.12 0.11 1z3t ‐0.01 ‐0.05 0 ‐0.01 ‐0.04 ‐0.05 1 0.88 0.78 ‐0.08 0 ‐0.07 ‐0.05 0 ‐0.07 ‐0.1 ‐0.05 ‐0.05 ‐0.05 GH7 1dy4 ‐0.02 ‐0.05 ‐0.01 ‐0.02 ‐0.05 ‐0.06 0.88 1 0.85 ‐0.05 0 ‐0.1 ‐0.07 0.04 ‐0.07 ‐0.11 ‐0.06 ‐0.06 ‐0.04 2rfw 0.02 0.02 0.02 0.02 ‐0.01 ‐0.03 0.78 0.85 1 0 0 ‐0.06 ‐0.07 0.04 ‐0.06 ‐0.09 ‐0.06 ‐0.06 ‐0.04 2ayh ‐0.09 ‐0.07 ‐0.1 ‐0.05 ‐0.05 ‐0.01 ‐0.08 ‐0.05 0 1 0.34 0.38 0.41 0.43 0.15 0.15 ‐0.02 ‐0.09 ‐0.09 1dyp ‐0.04 ‐0.08 ‐0.07 0.01 0.03 0.08 0 0 0 0.34 1 0.4 0.25 0.3 0.1 0.13 0.1 0.07 0.05 3ilf ‐0.05 ‐0.09 ‐0.09 ‐0.06 ‐0.04 0.02 ‐0.07 ‐0.1 ‐0.06 0.38 0.4 1 0.33 0.26 0.09 0.1 0.03 ‐0.01 ‐0.03 GH16 2vy0 ‐0.14 ‐0.14 ‐0.14 ‐0.04 ‐0.02 ‐0.01 ‐0.05 ‐0.07 ‐0.07 0.41 0.25 0.33 1 0.33 0.12 0.09 ‐0.02 ‐0.03 ‐0.04 1mve ‐0.03 ‐0.03 ‐0.02 0.03 0.03 0.07 0 0.04 0.04 0.43 0.3 0.26 0.33 1 0.23 0.17 0.08 0.11 0.11 3rq0 ‐0.08 ‐0.12 ‐0.08 0.02 0.05 0.02 0 0 ‐0.01 0.32 0.23 0.24 0.55 0.26 0.03 0.05 ‐0.04 ‐0.05 ‐0.05 1uai ‐0.02 0.01 ‐0.02 0.07 0.08 0.06 ‐0.07 ‐0.07 ‐0.06 0.15 0.1 0.09 0.12 0.23 1 0.56 0.41 0.2 0.18 lyase 1j1t 0 0.02 ‐0.02 0.11 0.16 0.14 ‐0.1 ‐0.11 ‐0.09 0.15 0.13 0.1 0.09 0.17 0.56 1 0.45 0.13 0.12 1vav 0 0.07 0 0.08 0.08 0.08 ‐0.05 ‐0.06 ‐0.06 ‐0.02 0.1 0.03 ‐0.02 0.08 0.41 0.45 1 0.19 0.17 peptid 2ifr ‐0.01 0.02 ‐0.03 0.05 0.05 0.12 ‐0.05 ‐0.06 ‐0.06 ‐0.09 0.07 ‐0.01 ‐0.03 0.11 0.2 0.13 0.19 1 0.9 ase 1y43 ‐0.02 0.01 ‐0.03 0.05 0.05 0.11 ‐0.05 ‐0.04 ‐0.04 ‐0.09 0.05 ‐0.03 ‐0.04 0.11 0.18 0.12 0.17 0.9 1

136

(D) A list of the consensus signatures for each subgroup based on (A), presented in the format R-N, where R corresponds to the one letter code of amino acid for the consensus signatures and N stands for the number of the spatial position listed in (A). # Subgroup Consensus signatures 1 hydrolase-11 W-1 N-6 V-7 Y-11 Y-13 W-14 E-18 Y-20 R-32 P-33 S-34 Q-38 W-39 E-46 2 hydrolase-12 N-1 W-2 Y-7 D-13 F-14 E-18 M-20 N-32 M-37 Q-45 E-46 3 hydrolase-7 R-7 Y-12 Y-15 D-16 S-17 Q-18 E-19 D-21 E-24 H-32 R-36 D-39 D-41 R-42 W-47 D-48 4 hydrolase-16 F-12 Y-13 W-18 E-20 D-22 E-24 L-25 Q-30 N-31 Y-32 Y-43 5 lyase T-4 R-6 E-8 Q-12 H-13 K-23 Y-44 K-46 Y-47 Q-48 6 peptidase-4 T-11 W-12 G-13 D-14 Q-23 G-24 D-26 D-27 W-28 E-29 P-31 E-36 D-37 F-38

Table 4-3. local active site alignment of one crystal structure of Alpha-L-arabinofuranosidase B (GH54, PDB ID: 1wd3) from Aspergillus kawachii and one comparative model of Alpha-L-arabinofuranosidase B from Actinoplanes sp. (Uniprot ID: G8SJ69 strain ATCC 31044 / CBS 674.73 / SE50/110). The residues in boldface are the functionally important residues as predicted by POOL using an 8% cutoff. The consensus signatures of the GH54 subgroup have been proposed to be C-1,C-2, D-3, E-4, D-5, M-6,D-7, E-8, and D-9 in the table and are blue highlighted. PDB ID Active site alignment consensus signature 1 2 3 4 5 6 7 8 9 1wd3 C176 C177 D179 E184 D189 m195 E196 Y199 W206 D219 E221 N222 N223 L224 G295 D297 F306 E308 G8SJ69 C203 C204 D206 E211 D216 M222 E223 Y226 W233 D246 E248 N249 N250 L251 G324 D326 F335 E337

137

4.3.3 Implementation of Consensus Signatures

To establish templates for characterization of the active sites for each functional subgroup for the enzymes in the superfamily, consensus signatures are identified for each subgroup. The consensus signatures were defined by majority vote. All proposed consensus signatures are shown in boldface in Table 4-2A and D. A detailed application of consensus signatures for each enzymatic subgroup within the Concanavalin A-like lectin/ glucanases superfamily will now be presented.

4.3.3.1 Glycoside Hydrolase

In our dataset, the majority of the proteins belong to one of the glycoside hydrolases subgroups, including GH11, GH12, GH16, GH7, and GH54. The classification of the previously characterized proteins into these subgroups using the SALSAs method corresponds well with the reported classification in the Carbohydrate Active enZYmes (CAZy) database

(http://www.cazy.org/)34. These glycoside hydrolases, also called glycosidases, hydrolyze the glycosidic bond between two or more carbohydrates as see in Figure 4-3A. These glycoside hydrolases function as catalyst enzymes for the degradation of biomass, for anti-bacterial defense strategies, and also in pathogenesis mechanisms 33.

The mechanisms of glycoside hydrolysis can be classified into one of two categories, inverting and retaining. Briefly, the reaction of inverting mechanisms is a one step, single- displacement reaction involving oxocarbenium ion-like transition states. The reaction occurs with acid/base assistance from two amino acid side chains, normally glutamic or aspartic acid side chains that are located 6-11 Å apart35 (see Figure 4-3B ) . A classical retaining mechanism is achieved via a two-step, double-displacement mechanism through a covalent glycosyl-enzyme intermediate. Reaction occurs with acid/base and nucleophilic assistance provided by glutamate

138 or aspartate side chains, located around 5.5 Å apart. In the first step, a nucleophile residue attacks the anomeric center to displace the aglycon and form a glycosyl enzyme intermediate. At the same time the other residue functions as an acid catalyst and protonates the glycosidic oxygen atom as the bond cleaves. In the second step, the glycosyl enzyme is hydrolyzed by water, with the other acidic residue acting as a base catalyst deprotonating the water molecule as it attacks. Each step passes through an oxocarbenium ion-like transition state (Figure 4-3C). 35

All five subgroups of GH proteins in the Concanavalin A-like lectin/ glucanases superfamily identified so far all react via a retaining mechanism.

A comparison of the consensus signatures for subgroups GH11 and GH12 (Table 4-2A) reveals that the catalytic residues for the proteins in these two subgroups are spatially overlapped as shown in Figure 4-4. Indeed, in the SCOP database, these two subgroups have even been catalogued into a single subgroup. Although the catalytic residues in the two subgroups are identically and spatially overlapped, additional local active site residues reported by POOL reveal that these two subgroups do have somewhat different active sites. The consensus signatures for the two subgroups based on Table 4-2A are presented in the format of R-N, where

R corresponds to the one letter code of the residue (amino acid), and the N stands for the number of the spatial position in Table 4-2A. For example, W-1 stands for Trp at position 1 and N-48 stands for Asn at position 48. In GH11, 14 key positions define the consensus signature, as shown in Table 4-2D: W-1, N-6, V-7,Y-11,Y-13,W-14, E-18, Y-20, R-32, P-33, S-34,Q-38, W-

39, E-46. While in GH 12, 11 key positions define the consensus signature, as: N-1, W-2, Y-7,

D-13, F-14, E-18, M-20, N-32, M-37, Q-45, E-46. The catalytic residues for GH11 and GH12 are spatially overlapped and the general enzyme mechanisms are quite similar. The differences in the consensus signatures residues, seen in residues other than the catalytic residues (see Figure

139

4-4), mostly correspond to ligand contact residues, which control the specificity for different substrates. It is known that the proteins of GH11 and GH12 in the superfamily favor the catalysis of xylanes20b and beta-glucananes36, respectively. Therefore, the SALSA method is able to distinguish between the GH-11 and GH-12 subgroups, as it provides useful information not only about the catalytically active residues, but also about the binding residues in the pocket.

The lack of substrate specificity may explain why normalized match scores in the GH16 subgroup are significantly lower than those of the other subgroups. The proteins in the subgroup use the same catalytic residues and hydrolyze the polycarbohydrate with similar catalytic mechanism, thus these proteins have been classified into one subgroup. This classification is consistent with the SCOP classification. However, the proteins in the subgroup have diversified substrate specificities and these proteins include cellobiohydrolase, kappa-carrageenase, porphyranase A, and endo-beta-1, 3-glucanases. The consensus signature residues in this subgroup are not as conserved as those of other subgroups. It is expected that the GH16 subgroup could further be classified into more substrate specific subgroups with improvement in the normalized match scores when more crystal structures of representative proteins that catalytic different substrates are available. Therefore, the POOL-SALSAs method can afford useful information beyond just the catalytically active residues, including insight into the recognition residues in the local binding pocket.

4.3.3.2 Alginate lyase

Another functional subgroup consists of the lyases (as shown in Table 4-2). These proteins are one of the predominant enzymes for depolymerization of polysaccharide chains.

Quite different from hydrolases, lyases cleave the glycosidic bond via β-Elimination37. In the process, a sugar is substituted with an acidic group next to the carbon atom forming the

140 glycosidic bond and results in the formation of a reducing end on one fragment and an unsaturated ring on the non-reducing end of the second fragment. Three lyases with sequence identity lower than 25% were examined and used as templates to generate a consensus signature consisting of ten residues. The consensus signature of this subgroup is: T-4, R-6, E-8, Q-12, H-

13, K-23, Y-44, K-46, Y-47 and Q-48, based on Table 4-2. The normalized match scores measuring the similarity of the local active sites is substantially higher than the corresponding sequence identity scores (see Table 4-5). Normalized match scores ranging from 0.41-0.55 by

BLOSUM62 and 0.41-0.56 by the CMS matrix are observed, again reflecting the expected similarity enrichment of the activity site residues in the local active sites.

4.3.3.3 Peptidase A4

Another subgroup in the superfamily is the glutamic peptidase, e.g. Scytalidoglutamic peptidase (SGP PDB ID: 2ifr) from Scytalidium lignicolum. Two residues GLN53 and GLU136 have been determined to be catalytically important using site directed mutagenesis. The

GLU136 works as a general base to activate water attack on the carbonyl carbon to form a tetrahedral intermediate. The GLN53 provides electrophilic assistance and oxidation stabilization. The 3D active site alignment for the only two available non-redundant crystal structures reveals that the catalytic and binding residues for the Scytalidoglutamic peptidase are almost identical to those of aspergilloglutamic peptidase (PDB ID: 1y43) as shown in Table 4- 2.

The consensus signature residues for this subgroup are now defined as: T-11, W-12, G-13, D-14,

Q-23, G-24, D-26, D-27, W-28, E-29, P-31, E-36, D-37 and F-38.

At this stage, it is not certain how many non-redundant 3D structures are required to define the consensus signature for a subgroup in an optimal way, but it appears to depend on the variability of the known structures. In most cases, structures were selected so that no two

141 proteins share sequence identity greater than or equal to 60%. Structures with experimentally validated biochemical function were also sought for purposes of generating the 3D functional templates at the active site to provide distinct signatures for functional annotations. One exception involves the glycoside hydrolase-54 (GH54) subgroup, in which only two highly similar crystal structures (PDB IDs: 1wd3 and 2d43) with sequence identity >90% are available in PDB database. With the help of homology modeling, one more 3D structure of an Alpha-L- arabinofuranosidase B catalytic from Actinoplanes sp. (strain ATCC 31044 / CBS 674.73 /

SE50/110) within this subgroup was obtained. It has a sequence identity 52% to the template protein, Alpha-L-arabinofuranosidase B from Aspergillus kawachii (PDB ID:1wd3). The plausible consensus signatures for the GH 54 subgroup are proposed, according to the two structures (one crystal structures and one homology model), as shown in Table 4-3. These two proteins were not put into the overall SALSA alignment table (Table 4-2), but the consensus signature of this subgroup has been used as a template for functional annotation of the SG proteins.

142

Figure 4-3. (A) The general mechanisms of glycoside hydrolases (GH). (B) inverting mechanism for GH and (C) retaining mechanism for GH. Illustrations modified from Ref33.

143

A B

C D

Figure 4-4. (A) Structural alignment of three xylanases from subgroup GH11, (PDB ID: 1m4w (white), 1h4g (magenta) and 1bcx (cyan)); (B) Local alignment of POOL predicted active site residues for these 3 proteins; (C) Structure alignment of endoglucanases from subgroup GH12, (PDB ID: 1uu4 (white), 1h8v (blue) and 2nlr(magenta)); (D) Local alignment of POOL predicted active site residues for these 3 proteins.

144

4.3.4 Biochemical functional annotation of structure genomics proteins (SGs)

After the establishment of the consensus signatures for all subgroups, the predicted set of residues for each SG protein in Table 4-1 was first aligned and scored against the consensus signatures for all subgroups. Detailed results are summarized in Table 4-5. Table 4-5A shows the normalized match scores for each of the protein structures against the consensus signatures of each biochemical functional subgroup in the Concanavalin A-like lectins/ glucanases superfamily. For most subgroups, the majority of the residues in the consensus signatures are conserved across the proteins in each subgroup. There typically only one or two spatial positions where residues are observed that represent variations on the established consensus signatures.

With the exception of the GH16 subgroup, all normalized match scores between the previously characterized proteins and the consensus signatures of their own subgroups are higher than 0.82.

The normalized match scores are significantly higher than the corresponding sequence identity scores between proteins in the same subgroup (see Table 4-5B). In all cases, normalized match score values are significantly lower when the previously characterized proteins are scored against the consensus signatures of subgroups other than their own. This evidence supports the hypothesis that the normalized match scores afford good discrimination among the functional subgroups of the superfamily.

For GH16, because of the lack of substrate specificity as mentioned above, multiple variations in the important active site residues for GH16 have been observed. Thus, a 1,3-1,4- beta-D-glucanase from Fibrobacter succinogenes (PDB ID: 1mv0) was assigned to be the mother protein which provides the active site template. The important positions in the protein

(shaded in blue in Table 4-2 or Table 4-4) play the role of the consensus signatures of the other subgroups. The normalized match score of the proteins in the GH16 subgroup generally fare

145 much lower than any of the other subgroups, ranging from 0.19 to 0.79. However, normalized match scores for the proteins belonging to the subgroups are still significantly higher than the normalized match scores of proteins are not belonging to GH16. Therefore, when a protein is scored against the consensus signature for a certain subgroup, the normalized match score can be used as an indicator to decide whether a protein can be classified into the subgroup. A positive normalized match score suggests that there is correlation between the protein local active site and the consensus signature of the subgroup. It indicates that the protein has similar biochemical function to the subgroup. A higher normalized match score indicates a higher similarity between active site residues of a query protein with the consensus signature of a functional subgroup.

When a negative normalized match score is obtained, it suggests there is no correlation between the active sites of the protein and the functional subgroup.

As demonstrated before, the SALSA method has successfully sorted proteins in the superfamily into seven subgroups and efficiently quantifies the difference or distance between the active sites of proteins using a scoring matrix. The consensus signatures for each subgroup were established for the analysis of the structural genomics (SG) proteins with unknown functions. One of most important applications of the SALSA method is the functional annotation of SG proteins. In the whole work flow, only the 3D structure of a query protein is required as input, providing advantages in the analysis of SG proteins. It is noticed that the performance of any functional residue prediction method that requires a sequence alignment is expected to degrade in the application to SG proteins, because the relationships between sequence and function for the well-characterized proteins in the test set is better established and because SG proteins tend to have more distant sequences. THEMATICS and POOL utilize only the 3D structure of the query protein for active sites prediction. They can return predictions of

146 local active sites for novel folds and engineered structures, as well as for proteins with none or few sequence homologues. These predictions of active site residues should be as reliable for these difficult cases as they are for the well-characterized proteins in the benchmark/testing sets.

The SALSA method requires accurate prediction of the functionally important residues, and the

POOL predictions therefore provide some advantage for the SALSA method.

A SG protein from Mycobacterium smegmatis str. MC2 155.( PDB ID 3rq0), was annotated as a putative glycoside hydrolase -16. The aligned local active site residues of this protein were scored against all the consensus signatures of the seven known functional subgroups in the superfamily and the results are shown in Table 4-5A. The highest normalized match score value was 0.15 when it was scored against the predicted residues of the known member of GH16, a 1,3-1,4-beta-D-glucanase from Fibrobacter succinogenes (with PDB ID:

1mv0). GH16 has more variations in the consensus signature positions – see the Table 4-4 which were shaded in blue for GH16) than other subgroups. This SG protein was further aligned and scored against other representative proteins in the GH16 subgroup. The SG protein has a positive normalized match score value (ranging from 0.15 to 0.45) to all of the previously characterized GH16 protein structures, including 1,3-1,4-beta-d-glucan 4 glucanohydrolase from

Hybrid Bacillus (PDB ID: 2ayh), porphyranase A from Zobellia galactanivorans(PDB ID: 3ilf), kappa-carrageenase from Pseudoalteromonas carrageenovora (PDB ID: 1dyp), endo-beta-1,3- glucanases from Pyrococcus furiosus (PDB ID: 2vy0), and 1,3-1,4-beta-d-glucanase from

Fibrobacter succinogenes (PDB ID: 1mve). All these previously characterized GH16 proteins have a sequence identity of no more than 36% to this SG protein. The sequence identity of this

SG protein to 1,3-1,4-beta-d-glucanase from Fibrobacter succinogenes (PDB ID: 1mve) is especially low, at only 4% with a global RMSD of 4.71Å. If we use a global sequence- or

147 structure- based method to annotate the protein, it may lead to the conclusion that the SG proteins does not belong to the same subgroup as 1mve. Yet, the predicted active site residues of the SG protein are positioned in space in nearly identical arrangement with the consensus signatures of GH16. For example, the catalytic residues (Glu-Asp-Glu) overlap perfectly with those of the known GH16 proteins. Charged residues (T167 in 3rq0) and aromatic residues

(W191, H169 and W142) is also overlapped to the known GH16 proteins, though there were residues with difference possibly attributable to the SG to interacting specifically with a different sugar molecule ( see Figure 4-5). Therefore, because of the match at the catalytic site, SALSA predicts that the SG protein is plausibly a GH16 protein.

Four more SG proteins in the concanavalin-A like lectins/glucanases superfamily are listed in Table 4-1 and Table 4-5, for instance the putative glycoside hydrolases from

Bacteroides thetaiotaomicron and Parabacteroides distasonis (PDB ID: 3h3l and 3osd, respectively). A sequence similarity search of the putative glycoside hydrolase from

Parabacteroides distasonis (PDB ID: 3osd) was performed. The top five hits were all SG proteins of unknown or uncertain function, for instance the putative glycoside hydrolase from

Bacteroides thetaiotaomicron (PDB ID: 3h3l) with sequence identity of 58% and another SG protein annotated as a glycosyl hydrolase from Bacteroides thetaiotaomicron (PDB ID: 3hbk) with a sequence similarity of 39%. The highest ranking hit with known biochemical function is beta-glucanase from Hybrid Bacillus (PDB ID: 2ayh); it has only 13% sequence with the query protein (PDB ID: 3osd). A structure similarity search on the same protein was performed using

Dali22. Again, the top hits (with a Z score above 15) were all SG proteins. Therefore simple transfer of function based on sequence or structural similarity does not provide much functional information.

148

These four SG proteins were aligned and scored against the consensus signatures of all of the GH subgroups. The highest normalized match score is zero, obtained when the putative glycoside hydrolase from Bacteroides thetaiotaomicron (PDB ID: 3osd) scored against GH16.

Other normalized match scores obtained were all negative when these four SG proteins scored against any of the consensus signatures of the previously characterized GH subgroups (see Table

4-5), suggesting that they are not likely to be members of any of the previously known glycoside hydrolases in the superfamily.

A structural inspection of these local active sites discloses that the key catalytic residues for GH16 and GH7 are comprised of Glu-Asp-Glu motifs, located on a single β sheet, but there are not three adjacent Glu/Asp residues in the SG proteins. A superimposition of the key predicted active site residues for the SG protein (PDB ID: 3osd) to the beta-glucanase (a GH16,

PDB ID: 2ayh) shows that the SG proteins are unlikely to be members of GH16, which is consistent with the scoring results (normalized match score) mentioned above.

149

Table 4-4. (A) Predicted functionally important active site residues from POOL for five well- studied glycoside hydrolases and one SG protein from Mycobacterium smegmatis str. MC2 155 (PDB ID: 3rq0) structurally aligned to each other. Each column represents a structurally distinct position in the fold. The residues shaded in blue present the ligand contact residues reported in the literature. The boldface residues are the proposed consensus signatures. (B) The sequence identity of the SG protein and 5 GH16 proteins; (C) The PM values of the SG protein and 5 GH16 proteins.

A

PDB Active Site Alignment 2ayh (gray) N26 E63 V88 S90 F92 Y94 W103 E105 D107 E109 L111 Q119 N121 Y123 E131 Y147 K178 N182 G186 1dyp (cyan) ‐ V114 V139 A142 W144 Y146 Y161 E163 D165 E168 ‐ D179 D181 H183 H205 Y221 N252 S256 R260 3ilf (organe) W56 A102 S128 F130 W131 R133 F137 E139 D141 E144 ‐ H163 N165 H167 G179 Y195 R228 D232 F236 2vy0 (magenta) G71 R122 W150 A152 W154 L156 C168 E170 D172 E175 L177 H185 H189 Y193 T200 F219 Y259 N263 G267 1mve (green) ‐ E4 ‐ S31 F40 Y42 W54 E56 D58 E60 ‐ Q70 N72 I74 E85 Y101 H99 N139 S143 3rq0 (SG) (navy) E64 K111 W128 A140 W142 G144 E148 E150 D152 E155 Y157 A165 T167 H169 T179 W191 F234 N238 G243

B

2ayh 1dyp 3ilf 2vy0 1mve 3rq0 2ayh 1 0.31 0.23 0.30 0.11 0.22 1dyp 0.31 1 0.31 0.37 0.04 0.30

3ilf 0.23 0.31 1 0.30 0.07 0.25 2vy0 0.30 0.37 0.30 1 0.11 0.36 1mve 0.11 0.04 0.07 0.11 1 0.04

3rq0 (SG) 0.22 0.30 0.25 0.36 0.04 1 C

2ayh 1dyp 3ilf 2vy0 1mve 3rq0 2ayh 1 0.55 0.48 0.39 0.88 0.22 1dyp 0.55 1 0.63 0.4 0.46 0.43 3ilf 0.48 0.63 1 0.55 0.39 0.45 2vy0 0.39 0.4 0.55 1 0.25 0.26 1mve 0.88 0.46 0.39 0.25 1 0.15 3rq0 (SG) 0.22 0.43 0.45 0.26 0.15 1

150

PDB ID 2ayh Gray 1dyp Cyan 3ilf Organe 2vy0 Magenta 1mve Green 3rq0 (SG) Navy

Figure 4-5. Structural alignment of the predicted active site residues of a SG protein from Mycobacterium smegmatis str. MC2 155 (PDB ID: 3rq0) to the active site residues of five known GH16 proteins. Local alignment of the active site residues of the SG protein confirm that it has been correctly annotated as GH16. Residues from different proteins are color coded according to the legend.

151

Table 4-5. (A) Normalized match scores of proteins in the concanavalin A-like lectin/ glucanases superfamily scored against the consensus signatures for each subgroup using BLOSUM62. Known members of each subgroup are shown in colored highlighting.

PDB ID GH11 GH12 GH7 GH16 lyase peptidase 1m4w 1.00 ‐0.07 ‐0.27 ‐0.09 0.06 ‐0.35 1h4g 0.97 ‐0.04 ‐0.28 ‐0.13 0.15 ‐0.25 1bcx 1.00 ‐0.06 ‐0.31 ‐0.09 0.06 ‐0.35 1uu4 ‐0.02 1.00 ‐0.06 ‐0.18 ‐0.02 ‐0.26 1h8v ‐0.08 1.00 ‐0.11 ‐0.18 ‐0.10 ‐0.26 2nlr 0.02 0.82 ‐0.14 ‐0.17 ‐0.11 ‐0.15 1z3t ‐0.14 ‐0.08 1.00 0.00 ‐0.12 ‐0.28 1dy4 ‐0.13 ‐0.06 1.00 0.08 ‐0.08 ‐0.24 2rfw ‐0.12 ‐0.06 0.90 0.08 ‐0.10 ‐0.27 2ayh ‐0.21 ‐0.31 0.01 0.79 ‐0.14 ‐0.27 1dyp ‐0.10 ‐0.21 0.04 0.43 ‐0.14 0.03 3ilf ‐0.22 ‐0.21 ‐0.03 0.36 ‐0.09 ‐0.11 2vy0 ‐0.28 ‐0.36 ‐0.03 0.19 ‐0.24 ‐0.12 1mve ‐0.16 ‐0.36 ‐0.09 1.00 ‐0.13 ‐0.23 1uai ‐0.08 ‐0.24 ‐0.19 ‐0.14 1.00 ‐0.18 1j1t ‐0.08 ‐0.15 ‐0.25 ‐0.20 0.94 ‐0.36 1vav ‐0.07 ‐0.22 ‐0.28 ‐0.29 0.94 ‐0.29 2ifr ‐0.25 ‐0.18 ‐0.18 ‐0.27 ‐0.25 1.00 1y43 ‐0.27 ‐0.22 ‐0.15 ‐0.24 ‐0.23 0.96 SG proteins 3rq0 ‐0.08 ‐0.15 ‐0.04 0.15 ‐0.16 ‐0.23 3osd ‐0.17 ‐0.30 ‐0.22 0 ‐0.14 ‐0.40 3nmb ‐0.23 ‐0.33 ‐0.22 ‐0.03 ‐0.14 ‐0.39 3h3l ‐0.19 ‐0.32 ‐0.32 0.00 ‐0.12 ‐0.39 3hbk ‐0.23 ‐0.32 ‐0.26 ‐0.05 ‐0.16 ‐0.27

152

(B) Sequence identity scores for proteins in the concanavalin A-like lectin/ glucanases superfamily scored against representative proteins preventatives from each of the functional subgroups. Known members of each subgroup are shown in colored highlighting.

PDB ID GH11 GH12 GH7 GH16 lyase peptidase 1m4w 1.00 0.09 0.08 0.11 0.03 0.02 1h4g 0.52 0.10 0.10 0.06 0.06 0.03 1bcx 0.63 0.10 0.11 0.07 0.05 0.06 1uu4 0.09 1.00 0.08 0.06 0.07 0.11 1h8v 0.15 0.44 0.08 0.05 0.03 0.04 2nlr 0.15 0.15 0.09 0.07 0.05 0.06 1z3t 0.08 0.08 1.00 0.13 0.02 0.16 1dy4 0.06 0.08 0.56 0.08 0.02 0.14 2rfw 0.08 0.10 0.49 0.04 0.05 0.08 2ayh 0.11 0.06 0.13 0.25 0.12 0.08 1dyp 0.09 0.06 0.05 0.14 0.08 0.02 3ilf 0.04 0.03 0.02 0.10 0.05 0.08 2vy0 0.03 0.07 0.04 0.14 0.05 0.04 1mve 0.12 0.08 0.03 1.00 0.02 0.12 1uai 0.03 0.07 0.02 0.12 1.00 0.03 1j1t 0.05 0.01 0.05 0.05 0.10 0.09 1vav 0.07 0.04 0.04 0.05 0.19 0.06 2ifr 0.02 0.11 0.16 0.08 0.03 1.00 1y43 0.06 0.04 0.15 0.07 0.06 0.52 SG proteins 3rq0 0.10 0.09 0.10 0.07 0.10 0.07 3osd 0.09 0.02 0.04 0.09 0.06 0.07 3nmb 0.05 0.05 0.08 0.11 0.12 0.07 3h3l 0.06 0.06 0.07 0.12 0.12 0.07 3hbk 0.04 0.06 0.12 0.10 0.13 0.07

153

A B

C D

Figure 4-6. POOL predicted active site residues shown explicitly as sticks for the putative glycoside hydrolases with PDB ID; (A) 3osd; (B) 3nmb; (C) 3hbk; and (D) 3h3l. It is also noted that one SG protein from Bacteroides thetaiotaomicron (PDB ID: 3osd) shares some similarity in its active site residues with two other SG proteins (PDB IDs: 3nmb and 3hbk)

154

Further investigation of the local active sites of all known GH subgroups in the concanavalin A-like lectins/glucanases superfamily revealed that for GH11 and GH12, the two catalytic residues (typically two glutamates: a proton donor and a nucleophilic residue) are arranged into a roughly 6Å diagonal line across two/three β-sheets; while the catalytic (Glu/Asp) residues in the GH16 and GH7 subgroup are located on a straight line and on a single β sheet with around 5-6Å distance between them. Based on the Figure 4-6 images of the functionally important residues in these SG proteins predicted by POOL, these predicted active sites do not match any of the previously characterized glycoside hydrolases. For example, in 3osd (Figure

4-6A), several Glu/Asp residues have been predicted to be functionally important. However, no pair of Glu/Asp residues are observed in at a 5-6 Å distance from each other, presumably necessary for a catalyzing a hydrolysis reaction. The distances between Glu164 and Asp191,

Asp175, and Asp111 in 3osd are around 9, 10, and 9 Å, respectively.

Figure 4-6A, B and C shows that the POOL-predicted active sites of 3osd, 3nmb and

3hbk are very similar. The involvement of Arg, His, Asp and Lys in the active sites is reminiscence of the consensus signature for the alginate lyase subgroup. The superimposition/alignment of the active site residues of these three SG proteins to lyases shows some degree of similarity in catalytic residues. Specifically, in the putative glycoside hydrolase from Parabacteroides distasonis atcc 8503 (PDB ID: 3hbk), the predicted important residues

Lys229, Tyr235 and His254 in 3hbk are superimposable on three out of the four catalytic residues in lyase (e.g. Arg59, Tyr199, and His104 in the 1vav structure). An additional Gln252 in 3hbk is 3-4 Å away from the nearest catalytic residue (Gln102 in 1vav). This variation weakens the case this SG protein (PDB ID:3hbk) is a lyase. Another interesting observation is the pair of Glu-Gln residues located 4.5 Å away from each other in these three SG proteins, e.g.

155

Glu164-Gln273 in 3osd, Glu162-Gln271 in 3nmb, and Glu145-Gln252 in 3hbk, which is identical to the Glu-Gln catalytic dyad in the Glutamic peptidase subgroup. This indicates that these SG proteins might use a catalytic mechanism similar to the peptidase subgroups for the degradation of biopolymers. However, because the other active site residues show differences from the ligand contact residues of the known glutamic peptidases, it may further suggests that the specificities of the SG proteins are quite different from the known glutamic peptidases in the superfamily. Nevertheless, the SALSA method can provide detailed insight about the important active site residues of SG proteins and disclose the similarity of these active residues to the catalytic residues of known members of a subgroups

Figure 4-6D shows the important active site residues reported by POOL for the last SG protein, a putative glycoside hydrolase from Parabacteroides distasonis ATCC 8503 (PDB ID:

3h3l). A detailed analysis of the important residues reveals that three predicted residues (Glu232,

Tyr168 and Asp153) in the SG protein with PDB ID:3h3l align to the catalytic motif of

GH12/GH11(e.g. Glu176, Tyr89 and Glu87 in the 1m4w, a GH12 protein structure). Glu176 and Glu87 are the catalytic residues in the known GH12. This suggests that Glu232 and Asp153 in the SG protein also possibly work as catalytic residues for this protein since they are spatially overlapped with the catalytic residues of GH12 (see Figure 4-7).

However, there are some notable differences. (1) The catalytic residues in the known

GH12 (PDB ID 1m4w) are located the middle of the binding cleft; while Glu232 and Asp153 in the SG protein are located at one side entrance to the binding cleft. (2) The other active site residues are significantly different in this Parabacteroides distasonis atcc 8503 SG protein (PDB

ID: 3h3l) from the aligned residues in the beta-1,4-xylase protein from Nonomuraea flexuosa

(PDB ID 1m4w), the latter being known to belong to GH12. (3) Three residues on the opposite

156 side of the binding cleft in the SG protein are comprised of Asp98, His256, Asp255; these residues resemble a metal binding motif. All in all, the observations may suggest that the 3h3l structure is a new type glycoside hydrolase with substrate specificity different from the known

GH proteins in the concanavalin A like glucanase superfamily.

In sum, the very poor match between the structurally aligned residues of four putative

GH proteins and the consensus signatures of the five known GH subgroups indicates that they do not share function with any known GH proteins and they are possibly are members of new functional subgroups in the superfamily.

4.3.5 The limitations of the method

In this chapter, we only focused on analysis of the enzymatic proteins in the concanavalin

A-like Lectins/Glucanases, i.e. the proteins with catalytic function. The proteins that work as sugar binding modules with similar concanavalin A-like Lectins/Glucanases structures but without any enzymatic functions are beyond the scope of this thesis. The POOL method is a machine learning method, meaning that it was established based on training on the CSA-100, a set of 100 enzymes with literature references confirming the role of specific residues in catalysis.

The proteins in the CSA-100 consist of enzymatic proteins exclusively. Therefore, the SALSA method, as it has been developed so far, has not been optimized to analyze non-enzymatic proteins. In the future, it is possible to establish a machine learning method by training on the set of sugar binding proteins with non-enzyme functions to establish a proper method to discriminate between the enzymatic and non-enzymatic proteins, and to predict the sugar binding residues of the non-enzyme proteins. Then, the prediction of the functions of the non-enzymes is possible in principle, using a SALSA-type approach of local structural alignment of the predicted interaction sites.

157

Figure 4-7. A GH12 protein (PDB: 1m4w) from Nonomuraea flexuosa compared to a SG protein from Parabacteroides distasonis atcc 8503. POOL-predicted active site residues are shown explicitly as sticks. The active site residues in the GH12 protein are colored grey while the active site residues in SG proteins are colored blue.

158

4.4 Conclusions

While global structure and protein sequence comparison sometimes can be valuable tool for evaluating protein biochemical function, the present method provides unique insight to study and predict the biochemical functions of a protein using its local active site residues. This method requires only the 3D structure of the query protein as input. Even when there is no detectable domain or fold similarity, the present methodology is still capable in principle of predicting functional information from the local active sites.

The SALSA method has been successful in sorting the members of known function into seven functional subgroups in the concanavalin A-like lectins/glucanases superfamily. The generation of a consensus signature for each of the functional subgroup in the superfamily is useful in the screening the protein structures of unknown function in the superfamily. This chapter reports one SG protein to be correctly annotated and four other SG proteins that are likely to belong to new functional subgroups. So far, the SALSA method is able to provide detailed information around the protein active sites and the catalytic mechanism. A automated version of the methodology for high throughput screening of proteins of unknown function is being implemented.

159

4.5 References

1. Rose, P. W.; Beran, B.; Bi, C.; Bluhm, W. F.; Dimitropoulos, D.; Goodsell, D. S.; Prlic, A.;

Quesada, M.; Quinn, G. B.; Westbrook, J. D.; Young, J.; Yukich, B.; Zardecki, C.; Berman, H. M.;

Bourne, P. E., The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids

Research 2011, 39, D392-D401.

2. (a) Berman, H. M.; Bhat, T. N.; Bourne, P. E.; Feng, Z. K.; Gilliland, G.; Weissig, H.; Westbrook,

J., The Protein Data Bank and the challenge of structural genomics. Nature Structural Biology 2000, 7,

957-959; (b) Berman, H. M.; Westbrook, J. D., The impact of structural genomics on the protein data bank. American journal of pharmacogenomics : genomics-related research in drug development and clinical practice 2004, 4 (4), 247-52; (c) Westbrook, J.; Feng, Z. K.; Chen, L.; Yang, H. W.; Berman, H.

M., The Protein Data Bank and structural genomics. Nucleic Acids Research 2003, 31 (1), 489-491.

3. (a) Erdin, S.; Lisewski, A. M.; Lichtarge, O., Protein function prediction: towards integration of similarity metrics. Curr. Opin. Struct. Biol. 2011, 21 (2), 180-188; (b) Tian, W. D.; Skolnick, J., How well is enzyme function conserved as a function of pairwise sequence identity? Journal of Molecular

Biology 2003, 333 (4), 863-882.

4. (a) Schnoes, A. M.; Brown, S. D.; Dodevski, I.; Babbitt, P. C., Annotation Error in Public

Databases: Misannotation of Molecular Function in Enzyme Superfamilies. Plos Computational Biology

2009, 5 (12); (b) Brown, S. D.; Gerlt, J. A.; Seffernick, J. L.; Babbitt, P. C., A gold standard set of mechanistically diverse enzyme superfamilies. Genome Biology 2006, 7 (1).

5. Wilson, D.; Madera, M.; Vogel, C.; Chothia, C.; Gough, J., The SUPERFAMILY database in

2007: families and functions. Nucleic Acids Res. 2007, 35 (Database issue), D308-13. Epub 2006 Nov 10.

6. Babbitt, P. C.; Hasson, M. S.; Wedekind, J. E.; Palmer, D. R. J.; Barrett, W. C.; Reed, G. H.;

Rayment, I.; Ringe, D.; Kenyon, G. L.; Gerlt, J. A., The Enolase Superfamily: A General Strategy for

Enzyme-Catalyzed Abstraction of the α-Protons of Carboxylic Acids†. Biochemistry 1996, 35 (51),

16489-16501.

160

7. Wei, Y.; Ringe, D.; Wilson, M. A.; Ondrechen, M. J., Identification of functional subclasses in the DJ-1 superfamily proteins. Plos Computational Biology 2007, 3 (1), 120-126.

8. Naumoff, D. G., Hierarchical classification of glycoside hydrolases. Biochemistry (Mosc). 2011,

76 (6), 622-35.

9. Ashkenazy, H.; Erez, E.; Martz, E.; Pupko, T.; Ben-Tal, N., ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids. Nucleic Acids Research

2010, 38, W529-W533.

10. Lichtarge, O.; Bourne, H. R.; Cohen, F. E., An evolutionary trace method defines binding surfaces common to protein families. Journal of Molecular Biology 1996, 257 (2), 342-358.

11. (a) Sankararaman, S.; Sha, F.; Kirsch, J. F.; Jordan, M. I.; Sjolander, K., Active site prediction using evolutionary and structural information. Bioinformatics 2010, 26 (5), 617-624; (b) Sankararaman,

S.; Sjoelander, K., INTREPID-INformation-theoretic TREe traversal for Protein functional site

IDentification. Bioinformatics 2008, 24 (21), 2445-2452.

12. Hendlich, M.; Rippmann, F.; Barnickel, G., LIGSITE: Automatic and efficient detection of potential small molecule-binding sites in proteins. Journal of Molecular Graphics & Modelling 1997, 15

(6), 359-+.

13. Laskowski, R. A., SURFNET - A Program for visualizing molecular-surfaces, cavities, and intermolecular interactions. Journal of Molecular Graphics 1995, 13 (5), 323-&.

14. Dundas, J.; Ouyang, Z.; Tseng, J.; Binkowski, A.; Turpaz, Y.; Liang, J., CASTp: computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues. Nucleic Acids Research 2006, 34, W116-W118.

15. Capra, J. A.; Laskowski, R. A.; Thornton, J. M.; Singh, M.; Funkhouser, T. A., Predicting Protein

Ligand Binding Sites by Combining Evolutionary Sequence Conservation and 3D Structure. Plos

Computational Biology 2009, 5 (12).

16. Laurie, A. T. R.; Jackson, R. M., Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics 2005, 21 (9), 1908-1916.

161

17. (a) Murga, L. F.; Ko, J.; Wei, Y.; Ondrechen, M. J., Central moments based statistical analysis for the determination of functional sites in proteins with thematics. 2006; p 215-224; (b) Wei, Y.; Ko, J.;

Murga, L. F.; Ondrechen, M. J., Selective prediction of interaction sites in protein structures with

THEMATICS. Bmc Bioinformatics 2007, 8.

18. Tong, W. X.; Wei, Y.; Murga, L. F.; Ondrechen, M. J.; Williams, R. J., Partial Order Optimum

Likelihood (POOL): Maximum Likelihood Prediction of Protein Active Site Residues Using 3D Structure and Sequence Properties. Plos Computational Biology 2009, 5 (1).

19. Somarowthu, S.; Yang, H. Y.; Hidebrand, D. G. C.; Ondrechen, M. J., High-Performance

Prediction of Functional Residues in Proteins with Machine Learning and Computed Input Features.

Biopolymers 2011, 95 (6), 390-400.

20. (a) Daiyasu, H.; Saino, H.; Tomoto, H.; Mizutani, M.; Sakata, K.; Toh, H., Computational and

Experimental Analyses of Furcatin Hydrolase for Substrate Specificity Studies of Disaccharide-specific

Glycosidases. Journal of Biochemistry 2008, 144 (4), 467-475; (b) Hakulinen, N.; Turunen, O.; Jänis, J.;

Leisola, M.; Rouvinen, J., Three-dimensional structures of thermophilic β-1,4-xylanases from

Chaetomium thermophilum and Nonomuraea flexuosa. European Journal of Biochemistry 2003, 270 (7),

1399-1412.

21. Bhaduri, A.; Pugalenthi, G.; Sowdhamini, R., PASS2: an automated database of protein alignments organised as structural superfamilies. Bmc Bioinformatics 2004, 5.

22. Holm, L.; Rosenstrom, P., Dali server: conservation mapping in 3D. Nucleic Acids Research

2010, 38, W545-W549.

23. Krieger, E.; Koraimann, G.; Vriend, G., Increasing the precision of comparative models with

YASARA NOVA - a self-parameterizing force field. Proteins-Structure Function and Genetics 2002, 47

(3), 393-402.

24. Gaertner, A.; Schwientek, P.; Ellinghaus, P.; Summer, H.; Golz, S.; Kassner, A.; Schulz, U.;

Gummert, J.; Milting, H., Myocardial transcriptome analysis of human arrhythmogenic right ventricular cardiomyopathy. Physiological Genomics 2012, 44 (1), 99-109.

162

25. UniProt, C., The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Research 2010, 38

(Database issue), D142-8.

26. Arnold, K.; Kiefer, F.; Kopp, J.; Battey, J. N. D.; Podvinec, M.; Westbrook, J. D.; Berman, H. M.;

Bordoli, L.; Schwede, T., The Protein Model Portal. Journal of Structural and Functional Genomics 2009,

10 (1), 1-8.

27. Ko, J.; Murga, L. F.; Wei, Y.; Ondrechen, M. J., Prediction of active sites for protein structures from computed chemical properties. Bioinformatics 21 (suppl 1), i258-i265.

28. Di Tommaso, P.; Moretti, S.; Xenarios, I.; Orobitg, M.; Montanyola, A.; Chang, J.-M.; Taly, J.-F.;

Notredame, C., T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension. Nucleic Acids Research 2011, 39, W13-

W17.

29. Shindyalov, I. N.; Bourne, P. E., Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering 1998, 11 (9), 739-747.

30. Somarowthu, S.; Yang, H.; Hildebrand, D. G. C.; Ondrechen, M. J., High-performance prediction of functional residues in proteins with machine learning and computed input features. Biopolymers 2011, n/a-n/a.

31. Porter, C. T.; Bartlett, G. J.; Thornton, J. M., The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Research 2004, 32, D129-D133.

32. Eddy, S. R., Where did the BLOSUM62 alignment score matrix come from? Nature

Biotechnology 2004, 22 (8), 1035-1036.

33. Naumoff, D. G., Hierarchical classification of glycoside hydrolases. Biochem.-Moscow 2011, 76

(6), 622-635.

34. Cantarel, B. L.; Coutinho, P. M.; Rancurel, C.; Bernard, T.; Lombard, V.; Henrissat, B., The

Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics. Nucleic Acids

Research 2009, 37, D233-D238.

163

35. Davies, G.; Henrissat, B., STRUCTURES AND MECHANISMS OF GLYCOSYL

HYDROLASES. Structure 1995, 3 (9), 853-859.

36. Sandgren, M.; Berglund, G. I.; Shaw, A.; Ståhlberg, J.; Kenne, L.; Desmet, T.; Mitchinson, C.,

Crystal Complex Structures Reveal How Substrate is Bound in the −4 to the +2 Binding Sites of

Humicola grisea Cel12A. Journal of Molecular Biology 2004, 342 (5), 1505-1517.

37. Lombard, V.; Bernard, T.; Rancurel, C.; Brumer, H.; Coutinho, P. M.; Henrissat, B., A hierarchical classification of polysaccharide lyases for glycogenomics. Biochemical Journal 2010, 432,

437-444.

164

CHAPTER 5 CONCLUSIONS AND FUTURE DIRECTIONS

165

5.1 Conclusions

Molecular modeling methods can accelerate and guide drug design and contribute to the understanding of the biochemical functions of gene products. Chapter 2 explored the application of the computational tools for structural based drug design of human African trypanosomiasis (HAT), a vector borne disease caused by several species of trypanosomes, affecting thousands of people every year. One validated target for HAT treatment is

Trypanosoma brucei Aurora kinase -1 (TbAUK1). This protein is homologous to the human

Aurora kinase, providing an opportunity to repurpose human Aurora kinase inhibitors for the development of TbAUK1 inhibitors. A comparative model structure of TbAUK1 has been built.

The docking studies help design and prioritize inhibitors based on a series of analogs of the pyrrolopyrazole inhibitors, danusertib analogs, currently in clinical trials for cancer. The

TbAUK1 model has provided further structure-based insights for design of inhibitor affinity and selectivity. New inhibitors designed using the TbAUK1 homology model have shown sub- micromolar inhibition in the T. brucei proliferation assay with 25-fold selectivity over human cells.

Chapter 3 described the application of molecular modeling techniques to investigate other targets for trypanosomiasis treatment, i.e. Trypanosoma brucei phosphodiesterases B1

(TbrPDEB1) and Trypanosoma brucei phosphodiesterases B2 (TbrPDEB2). The homology models of TbrPDEB1 and TbrPDEB2 have been established and validated. The comparison of these two models has revealed that the TbrPDEB1 and TbrPDB2 share high similarity in primary sequences, model structures and their local active sites. Thus, it would be feasible to develop compounds which can inhibit the two TbrPDEBs simultaneously. The TbrPDEB1 model has proved to be a useful tool for structure-based inhibitor design and helps to rationalize the

166 interaction between TbrPDEB1 and catechol ether-containing compounds. Moreover, the docking studies together with the binding free energy calculations have revealed the important factors driving compound potency. The comparison between human PDE4 enzymes and

TbrPDEB1 has shown differences in their active site residues and thus has provided the opportunity to develop inhibitors with specificity. In sum, the homology modeling and docking studies for TbrPDEB1 have provided valuable tools for rational inhibitor design.

Chapter 4 explores a computational method, Structurally Aligned Local Sites of Activity

(SALSA), to bridge the gap between the 3D structures and the biochemical functions of proteins.

This method facilitates the classification and identification of the function of proteins using the

3D structures of proteins as input. This method utilizes two previously developed computational active site predictors, POOL and THEMATICS, for active site prediction together with a local structural alignment and a scoring matrix to establish the consensus signatures of the different functional subgroups of a superfamily. As a proof of concept, the enzymes in the concanavalin-

A like lectins/glucanases superfamily have been classified into subgroups according to their biochemical functions. The proteins in this superfamily have a similar fold, consisting of a sandwich of 12-14 antiparallel beta strands in two curved sheets. The enzymes in this superfamily have been successfully sorted into six subgroups with different biochemical functions and information about the function of the SG proteins in this superfamily also has been provided. One SG protein is plausibly correctly annotated and four SG proteins are likely to belong to new, as yet unidentified functional subgroups.

167

5.2 Future directions

5.2.1. Drug discovery for human African trypanosomiasis (HAT)

It is noted that the water molecules in the binding pocket of TbrPDEB1/TbrPDEB2 may strongly affect the ligand binding and the affinity of the small molecule, and these are prominent features in published crystallographic analyses.1-3 Replacing the localized water molecules in the binding pocket of the protein structures is one of the most important steps for a small molecule to gain its affinity. Understanding the positioning of these localized water molecules in the binding pocket would be of significance in the ligand design. However, it is challenging to address the binding water molecules in a protein structures and even harder to address these water molecules in a homology model. The information from the homologous crystal structures can provide some insight into how ligands and proteins interact through the mediated water molecules. It would be most useful to obtain the crystal structure of a protein of interest. The crystal structure of a protein target can be used to compare and validate the homology models, but also to address the water molecules in the binding pocket explicitly with advanced computational techniques. In the cases of TbrPDEB1 and TbrPDEB2, we expect that structural water molecules play important roles. Thus, obtaining the crystal structures for the TbrPDEB1 and TbrPDEB2 with the compounds of interest would be extremely useful and would enhance the reliability of molecular modeling for structure based drug design.

It is also feasible to combine computational ligand-based drug design techniques in addition to the structure-based drug design, after biochemical activity data become available for more compounds. Ligand-based design is grounded in the assumption that ligands similar to an active ligand are more likely to be active than other random ligands. Ligand-based approaches compare two- or three-dimensional chemistry, shape, electrostatic features, and interaction points

168

(e.g., pharmacophore points) of ligands to assess their similarity and can be used as complimentary tools for compound library screening.

It has been noted that drugs for HAT all need to permeate into the central nervous system, because in the fatal stage II of HAT disease, the parasite invades the CNS. Thus physicochemical properties of TbAUK1 and TbrPDE inhibitors need to be aligned with several key druglike attributes of CNS-active drugs, including (1) lipophilicity, calculated partition coefficient (ClogP); (2) calculated distribution coefficient at pH=7.4 (ClogD); (3) molecular weight (MW); (4) topological polar surface area (TPSA); (5) number of hydrogen bond donors

(HBD); and (6) most basic center (pKa)).4 The computed properties of current lead molecules, two danusertib analogs that are likely inhibitors of TbAUK1 (NEU325 and NEU327), are nearly in the range of CNS-active drugs (as shown in Table 5-1). However, blood-brain barrier permeation may be an issue with the danusertib chemotype because the MW, LogP, and HBD for these analogs are higher than CNS active drugs and thus further modifications are needed.

169

Table 5-1. Comparison of the computed properties of current lead molecules, two danusertib analogs that are likely inhibitors of TbAUK1 (NEU325 and NEU327), to the CNS-active drugs CNS MPO Property NEU-327 NEU-335 desirable6

MW 494.6 486.6 ≤ 360 ClogP 3.5 3.8 ≤ 3 ClogD 3.5 3.8 ≤ 2 HBD (OH+NH) 2 2 ≤ 0.5

TPSA 84.6 84.6 40 < x ≤ 90 pKa 0.12 0.12 ≤ 8

5.2.2. Structurally Aligned Local Sites of Activity (SALSAs) for Functional Characterization of Enzyme Structures in the Concanavalin A-like Lectins/Glucanases Superfamily In this thesis, we have defined the normalized match score to quantify the difference between the protein pairs. To improve the quantification of the distance/difference among protein active sites, a more sophisticated scoring method for 3D local active site comparison should be developed. For instance, a scoring method could provide different weights to the substrate recognition residues and the catalytic residues in the local active site alignment.

In addition, our local active site alignment technique is mainly focused on sets of structures within a superfamily, since we take advantage of global structural alignment techniques, such as CE, T-coffee, PDBefold, and/or TOPOFIT, to obtain a coarse structural alignment before carrying out the local active site alignment. This approach would not be suitable for cross-fold protein comparison (i.e. the proteins with different 3D structures) or a large number of protein structures as input. A method, such as a pattern recognition algorithm, to achieve high throughput alignment of the local active site 3D motif would be highly desirable.

170

Ideally, this new method should be expected to match around eight to ten active site residues predicted from POOL.

5.3 References

1. (a) Hendrix, M.; Kallus, C., Phosphodiesterase Inhibitors: A Chemogenomic View. In

Chemogenomics in Drug Discovery, Wiley-VCH Verlag GmbH & Co. KGaA: 2005; pp 243-288; (b)

Beuming, T.; Farid, R.; Sherman, W., High-energy water sites determine peptide binding affinity and specificity of PDZ domains. Protein Science 2009, 18 (8), 1609-1619; (c) Mancera, R. L., Molecular modeling of hydration in drug design. Current Opinion in Drug Discovery & Development 2007, 10 (3),

275-280.

2. Wager, T. T.; Hou, X.; Verhoest, P. R.; Villalobos, A., Moving beyond Rules: The Development of a Central Nervous System Multiparameter Optimization (CNS MPO) Approach To Enable Alignment of Druglike Properties. ACS Chemical Neuroscience 2010, 1 (6), 435-449.

171

APPENDIX DESULFURIZATION OF CYSTEINE-CONTAINING PEPTIDES RESULTING FROM SAMPLE PREPARATION FOR PROTEIN ANALYSIS

Article published Rapid Communications in Mass Spectrometry 2010, 24 (3), 267‐275.

172

INTRODUCTION

Cysteine residues have highly reactive thiol groups and are thus prone to modifications, including oxidation1, nitrosylation2, or glutathioylation3. For example, disulfide bond formation due to the oxidation of pairs of cysteines is critical for three-dimensional folding4-7. Additionally, disulfide heterogeneity, including formation of non-native disulfide bonds from disulfide scrambling, as well as the occurrence of lanthionines8 or trisulfide bridges9, 10 is of concern in the production and shelf life of recombinant proteins11. Therefore, the identification of cysteine modifications is an important part of comprehensive protein characterization.

Mass spectrometry-based methods have been widely used in the analysis of cysteine modifications12-19. Conventional methods for disulfide mapping include enzymatic digestion of non-reduced proteins by pepsin at pH 2-413 or blocking free thiol groups by inclusion of N- ethylmaleimide (NEM)20 to prevent disulfide bond scrambling; however, pepsin generates non- specific cleavages, which increase the complexity of the digests. Alternatively, enzymes such as trypsin or Asp-N can provide specific cleavages; however, these enzyme require basic pH, which may lead to disulfide bond scrambling21, 22 . Following proteolytic digestion, the mass differences of peptides before and after reduction are analyzed by LC-MS and compared, in order to determine the free cysteine and disulfide containing peptides. To locate the linkage positions of disulfides, ESI-MS in the negative mode has been often employed23. Recently, electron capture dissociation (ECD)24 and electron transfer dissociation (ETD)17 have been reported to facilitate the disulfide linkage identification.

In a recent paper from our laboratory, we applied a novel algorithm, termed Search for

Modified Peptides (SeMoP), to discover unexpected modifications of peptides in LC-MS data25.

Two modifications, corresponding to the loss of sulfur from cysteine residue, were identified.

173

The first desulfurization, corresponding to cysteine to dehydroalanine conversion, was associated with the known heat and alkaline induced β-elimination of disulfide bridges12, 26, 27. Recently, formation of dehydroalanine via β-elimination of disulfide was utilized to generate diagnostic fragment ions for disulfide containing peptides12and for assessment of the stability of proteins 27.

In addition, conversion of cysteine to dehydroalanine in human serum albumin was linked to a specific form of hypoalbuminemia28.

The other desulfurization detected by the SeMoP algorithm25 corresponded to the conversion of cysteine to alanine. This modification could potentially be erroneously assumed to be for a mutation resulting from a single nucleotide polymorphism (SNP). However, such a mutation is highly unlikely because two nucleotides would have to be changed to convert the codon for cysteine to alanine29, 30. On the other hand, it was documented that cysteine desulfurase can convert cysteine to alanine as part of sulfur trafficking in bacteria31. Recently, chemically induced desulfurization of cysteine was utilized by reduction catalyzed with palladium or Raney nickel in peptide synthesis32.

In this paper, we examined the two above desulfurization reactions of cysteine induced in sample handling for protein characterization in order to provide insight into proper sample preparation protocols. Atriopeptin and lysozyme were selected as model compounds to generated disulfide linked peptides. Tryptic digests of the two compounds were subjected to heating at various temperatures with or without of reducing reagents to study the formation of the modifications. In addition, we examined the reaction mechanism for the conversion of cysteine to alanine and proposed a thiyl radical-based mechanism for this conversion.

174

EXPERIMENTAL Materials

Atriopeptin and somatostatin were purchased from Anaspec (San Jose, CA). Egg white lysozyme, Tris buffer, tris (2-carboxyethyl) phosphine hydrochloride (TCEP), ethylenediaminetetraacetic acid (EDTA), iodoacetamide (IAM), ammonium bicarbonate, α- cyano-4-hydroxycinnamic acid (CHCA) and heavy water (D2O), 99.9% pure, were obtained from Sigma-Aldrich (St Louis, MO). Trypsin Gold, mass spectrometry grade, was purchased from Promega (Madison, WI). HPLC grade acetonitrile (ACN), trifluoroacetic acid (TFA) and formic acid (FA) were from Thermo Fisher Scientific (Fair Lawn, NJ). Ellman’s reagent

[DTNB, 5, 5 -dithiobis-(2-nitrobenzoic acid)] was purchased from Thermo Fisher Scientific

(Rockford, IL).

Preparation of Model Compounds

A stock solution of atriopeptin, lysozyme and somatostatin (1 μg/μL) was prepared in deionized water and stored at - 20 °C. A stock solution of reducing reagent, 50 mM DTT or

TCEP, was freshly made by dissolving DTT or TCEP in 100 mM ammonium bicarbonate.

Atriopeptin (0.1 μg/μL) was subjected to trypsin digestion for one hour with a 50:1 (w/w) substrate to enzyme ratio at 37 °C in 100 mM ammonium bicarbonate buffer, pH 8.0. Lysozyme was digested with a 30:1 (w/w) substrate to enzyme ratio at 37 °C for 4 hours. Digestion was stopped by addition of 1% FA.

Cysteine Desulfurization

175

Forty μL of a tryptic digest of atriopeptin/lysozyme was heated at 60 °C or 90 °C for up to

2 hours, either without reduction or in the presence of reducing reagent (5 mM of TCEP or DTT).

The atriopeptin digest was also prepared in D2O using the following procedure. Atriopeptin digest (40 μL) was dried to remove H2O and then resuspended in 100 mM ammonium bicarbonate in D2O (pH 8.0). Peptides in D2O were dried to remove residual H2O and finally resuspended in 40 μL of 100 mM ammonium bicarbonate prepared in D2O. Four μL of TCEP in

100 mM ammonium bicarbonate in D2O was added to 40 μL of atriopeptin solution. The atrioepeptin was heated in the presence of TCEP at 90 °C for 20 minutes, and then the peptides were extracted with a C18 ZipTip (Millipore, Billerica, MA) to remove TCEP and resuspended in 50% ACN : 0.1% TFA in H2O (50 : 50). Finally, the samples were dried and resuspended in

0.1% FA (in H2O) before MALDI MS analysis. The somatostatin peptide without digestion was directly to 90 °C for up to 2 hours alone or in the presence of 5 mM of TCEP.

Analysis of Free Sulfhydryl Groups

Ellman’s reagent was used to determine the free sulfhydryl groups in the atriopeptin digest before and after reduction, following the manufacturer’s protocol. The UV absorbance of the sample was measured at 412 nm in a Nanodrop ND-1000 spectrophotometer (Thermo Fisher

Scientific, Wilmington, DE). The concentration of sulfhydryl groups in the sample was calculated using the molar extinction coefficient of 2-nitro-5-thiobenzoic acid (14,150 M-1cm-1).

Mass Spectrometry

Samples were desalted by ZipTip μ−C18 pipette tips (Millipore) and spotted on the

MALDI plate followed by addition of MALDI matrix consisting of 10 mg/mL of CHCA in 50%

176

ACN: 0.1% TFA. Peptide MS and MS/MS spectra were obtained using a MALDI-TOF/TOF

4700 mass spectrometer (Applied Biosystems, Framingham, MA). The instrument was externally calibrated with Glu-fibrinopeptide B (m/z=1570. 67).

For ESI -MS analysis, 2 pmol/ μL of sample was infused into a LTQ-FT MS (Thermo

Fisher Scientific, San Jose, CA). Data were acquired in the data-dependent mode with a survey scan using the FT MS, and then the ten most intense ions were fragmented for MS2 analysis on the LTQ (CID fragmentation using ± 2.5 m/z isolation width with 28% normalized collision energy).

177

RESULTS AND DISCUSSION The goal of this paper was to study cysteine modifications induced during the sample preparation for protein characterization, specifically, the conversion of cysteine to dehydroalanine; and cysteine to alanine observed in our previous study25. To examine these modifications, atriopeptin, a small 23 amino acid peptide was selected as a model compound, since it contains only a single disulfide bond, and results two peptides linked with a disulfide bond after tryptic digestion without reduction, thus minimizing the potential interference of the non-cysteine peptides (see Figure 1A). Initially, the tryptic digest of atriopeptin was analyzed by

MS in the work to establish a baseline for comparison with the same sample processed under various conditions suitable for cysteine desulfurization.

MS Analysis of the Tryptic Digest of Atriopeptin In order to generate a model compound with disulfide linked peptide, atriopeptin, was first subjected to tryptic digestion for 1 hour at pH 8.0 without reduction. Figure 1B shows the ESI-

MS spectrum of the tryptic digest. Peaks at m/z 674.3 (+3) and 1010.47 (+2) correspond to the fully digested product, while the peak at m/z 801.6 (+3) is the result of a miscleavage. Two homodimers from disulfide scrambling, at m/z 712.3 (+2) and 873.1 (+3), can also be seen in

Figure 1B. The formation of homodimers by disulfide scrambling is likely a consequence of the digestion conditions (37 °C, pH 8.0)13, since higher amounts of the homodimers were found with longer digestion times (data not shown). The presence of an unpaired cysteine (sulfhydryl) has been reported to facilitate disulfide bond scrambling21. Ellman’s reagent was used to measure the free sulfhydryl groups in the atriopeptin tryptic digest. It was found that there were roughly 2% of free cysteine sulfhydryl groups in the non-reduced digest solution, and these sulfhydryl groups

178 likely triggered the disulfide scrambling during the heating under alkaline conditions leading to the formation of the homodimers.

The same atriopeptin digest sample was also analyzed by MALDI-MS for comparison.

Similar to the ESI-MS analysis, Figure 1C shows the expected tryptic digestion product of atriopeptin peak at m/z 2019.8, and a miscleaved digestion peak at m/z 2404.0. In addition, the two homodimers with m/z at 1423.5 (+1) and 2616.1 (+1) were observed. Besides the above ions found in the ESI-MS spectrum, two peaks with relatively high intensities at m/z 713.2 and

1309.6 relating to the peptide with a single cysteine (sulfhydryl group) were observed. These two ions are most likely the result of in-source fragmentation of the disulfide bond in the MALDI process33. The disulfide bond in the ions of m/z 2019.8, m/z 1423.5 and m/z 2616.1 was presumably cleaved during the MALDI process to generate the sulfhydryl products. Importantly, the free cysteine-containing peptides were not observed in the ESI-MS spectrum (Figure 1B), likely due to the lower energy in the ionization process for ESI compared to MALDI. Care should be taken to avoid the conclusion of the presence of reduced peptides in the MALDI MS analysis of disulfide bond-containing peptides.

Conversion of Cysteine to Dehydroalanine

We next turned to examine the conversion of cysteine to dehydroalanine with the atriopeptin tryptic digest analyzed above by heating the peptide at 37 °C for several days27. In order to accelerate this conversion, we heated atriopeptin digest to 60 °C or 90 °C in ammonium bicarbonate buffer (pH 8.0) to generate a sufficient amount of modified peptides that could be confidently identified by MALDI MS/MS analysis. As shown in Figure 2B, after heating the atriopeptin digest to 60 °C for 1 hour in 100 mM ammonium bicarbonate, an expected

179 dehydroalanine product (IGAQSGLVdANSFR) at m/z 1275.6 was detected, corresponding to a

34 Da loss of the original cysteine-containing peptide (IGAQSGLVCNSFR) at m/z 1309.6. The

MS/MS spectrum confirmed the conversion of IGAQSGLVCNSFR to IGAQSGLVdANSFR

(Figure 2C). This result is in agreement with other reports that heat and alkaline conditions can result ininduce β-elimination of disulfide bridges12, 26, 27 (Figure 1A). In this case, the dehydroalanine containing peptide at m/z 1275.6 could be generated via β-elimination from two disulfide linked tryptic peptides, (a) the original disulfide linked peptide with m/z 2019.8 (Figure

1C) and (b) the homodimer resulting from disulfide scrambling for the species with mass

2616.1(Figure 1C). During the heating process (pH 8.0), additional homodimers could be generated by disulfide scrambling; however, at the same time, the homodimers were undergoing decomposition to dehydroalanine, thiocysteine or other products e.g. thiolate anion34. Thus, the overall change in the abundance of homodimers will be the result of multiple (generation and decomposition) processes.

Interestingly, a small peak at m/z 1275.6, corresponding to the loss of 34 Da, was present in the control sample that was not subjected to heating, see Figure 2A. Presumably, this peak, corresponding to a dehydroalanine containing peptide, could be generated in the tryptic digestion process (37 °C for 1 hour). In order to confirm this hypothesis, MS spectra of the digested mixtures were collected every 10 minutes for 1 hour. The peak at 1275.6 was not observed by

MALDI-MS at the beginning of digestion, but was detected after 10 minutes, and its intensity increased with time. This result supports the hypothesis that the dehydroalanine product was generated during the digestion process. It is expected that prolong digestion times, for example overnight digestion, of non-reduced protein may lead to significant accumulation of dehydroalanine products resulting from β-elimination. Therefore, besides reported sample

180 preparation procedures, such as protein heating for denaturation21 and long time storage8, a widely used enzymatic digestion procedure of 37 °C was found to lead to formation of dehydroalanine.

Typically, the cysteine to dehydroalanine conversion was performed in 100 mM ammonium bicarbonate buffer at pH 8.0 since ammonium bicarbonate is commonly used as an

MS-friendly buffer in the sample preparation. However, ammonium bicarbonate is thermally unstable and extensive heating could lead to change in pH of the solution, thus potentially affecting the rate of β-elimination26. It was found that heating of 100 mM solution of ammonium bicarbonate to 60 °C for 1 hour leads to increase of pH from 8.0 to 8.5. In order to assess the effect of this pH change on the rate of β-elimination, the desulfurization reaction was performed in thermally stable Tris-buffer with pH adjusted to 7, 8 and 9. Based on the normalized peak height of the dehydroalanine product in the MALDI-MS spectra, it was found that there was no substantial change in the extent of cysteine to dehydroalanine conversion. These results suggest the thermal decomposition of ammonium bicarbonate should not substantially affect the conversion of cysteine to dehydroalanine.

Cysteine Desulfurization in the Presence of Reducing Reagents Since the β-elimination reaction requires disulfide bridges, reduction of disulfides should prevent cysteine desulfurization. However, when the reduction is performed at elevated temperature, such as in the case of reduction by DTT, β-elimination could compete with the reduction, potentially leading to accumulation of dehydroalanine product. In order to determine the amount of desulfurized product formed during reduction at elevated temperature, the atriopeptin digest was heated 60 °C for 1 hour in the presence of 5 mM DTT and analyzed by

MALDI MS. As can be seen from Figure 3, no peak corresponding to dehydroalanine was

181 detected. It was concluded that for atriopeptin digest, commonly applied experimental conditions for protein reduction, i.e. DTT at 60 °C for 30 minutes, lead to minimal formation of dehydroalanine.

The tryptic digest of atriopeptin was next heated to 60 °C for 1 hour in the presence of

TCEP. Surprisingly, heating the peptide with TCEP led to the formation of a new modification corresponding to the loss of 32 Da (Figure 3B), suggesting conversion of cysteine to alanine. To confirm this modification, the peak at m/z 1277.6, i.e. loss of 32 Da from peptide

IGAQSGLVCNSFR (m/z 1309.6), was subjected to MS/MS analysis shown in Figure 3C. It can be seen that starting from the y5 fragment ion, corresponding to the position of cysteine residue, the mass of fragments are shifted by 2 Da with respect to the dehydroalanine product which occurs without TCEP (Figure 2C), supporting the conversion of the cysteine to alanine.

To further explore the conversion of cysteine to alanine, a similar experiment was performed using a tryptic digest of lysozyme. MALDI-MS spectra were collected at different digestion times to assess the kinetics of cysteine desulfurization, see Figure 4. Initially, an ion, corresponding to two lysozyme tryptic peptides connected by a disulfide bridge (m/z 2542.5), can be clearly observed in Figure 4A. After 10 minutes of heating at 90 °C (Figure 4B), a peptide ion at m/z 1276.9, corresponding to a reduced form of the 2542.5 ion, appears along with a ion of relatively low intensity (m/z 1244.9), representing an alanine containing ion with a loss of 32 Da from the original cysteine containing peptide. The peak corresponding to the disulfide bond linked peptide ion (m/z 2542.5) was not detected, indicating that this peptide was completely reduced. Importantly, the intensity of the ion corresponding to the desulfurized peptide (m/z

1244.9) increased with additional heating as shown in Figure 4C. Comparing Figure 4B with

Figure 4C, an obvious change in the ratio of intensities of these two ions, with the generation of

182 the alanine containing peptide, is observed. MS/MS analysis confirmed the conversion of cysteine to alanine in this peptide. Moreover, the peak corresponding to alanine containing peptide increased even in the absence of a disulfide bond (Figure 4B & 4C), suggesting that the disulfide bond is not required for this conversion. Based on this result, the mechanism of conversion of cysteine to alanine appears to be different from that of β–elimination.

Mechanism of Cysteine to Alanine Conversion

We found that the unexpected conversion of cysteine to alanine can be induced by heating in the presence of TCEP and we thus investigated the mechanism of this conversion. A similar desulfurization of cysteine to alanine has been reported in glycan-peptide synthesis35. Based on that work, we hypothesized that the cysteine to alanine modification resulted from a direct conversion. In order to examine the proposed mechanism, the reaction was performed in D2O to

H2O. As shown in Figure 5 (insert), the direct conversion in D2O should lead to incorporation of a single deuterium atom. Thus, the newly formed alanine peptide increasing the mass by +1 Da shift compared to the alanine product obtained in H2O. On the other hand, mechanism that would involve formation of dehydroalanine should lead to incorporation of two deuteriums, thus increasing the mass with +2 Da, which can be distinguished by MS. The mass spectrum for reactions performed D2O and in H2O are in Figures 5 A and B, respectively. It can be seen that the mass of the original cysteine-containing peptide (m/z 1309.6) remained the same suggesting that the deuterium atoms on amino acids that are not involved in the cysteine to alanine conversion can be back exchanged after the peptide is resuspended in H2O. However, a 1 Da shift was observed for the alanine containing peptide generated in D2O (m/z 1278.6 in Figure 5A

183 vs. m/z 1277.6 in Figure 5B), indicating that the modification is indeed via the direct conversion of cysteine to alanine.

To further confirm that one deuterium atom was coupled to the alanine residue, MS/MS spectra of the alanine products generated in both D2O and H2O were obtained, see in Figures 5C and 5D, respectively. Comparing Figures 5C and 5D, there is a 1 Da shift for y5, whereas masses of y2, y3 and y4 remain unchanged. This result supports that a single deuterium atom is coupled to the alanine residue. The deuterium could be coupled at either the Cα or the

Cβ positions, as shown in the inset of Figure 5. The exact position on the alanine residue is not known at this time, and NMR could potentially provide an answer to the possible coupling position in the final alanine product.

Analogous to reference35, we propose that the conversion follows the thiyl radical mechanism, as shown in Figure 6. Generally, the cysteine thiyl radical can be generated by interaction with mild oxidants such as redox active transition metals, UV light and oxygen. Next, the cysteine thiyl radical can react with alkylphosphine (TCEP) to generate a phosphoranyl species. Decomposition of this product can yield an alkyl radical which causes the homolytic abstraction of a hydrogen atom from the remaining intact mercaptan, i.e. the cysteine residue.

This reaction leads to the formation of the desulfurized product, alanine, and a new cysteine thiyl radical. Importantly, as we have shown, this mechanism does not require a disulfide bridge, and thus any free cysteine residue should be accessible for this reaction. In summary, these results have shown that heating the cysteine containing peptide in the presence of TCEP leads to the conversion of cysteine to alanine via a radical mechanism. Therefore, while heating the disulfide cysteine containing peptides in the presence of DTT could efficiently reduce without

184 desulfurization of cysteine, heating the same peptide samples in the present of TCEP should be avoided.

Peptide Fragmentation Caused by Heating with TCEP

Since it is well know that heating protein/peptides can lead peptide bond cleavage causing protein/peptide fragmentation8, 11, we further explored whether heating a peptide in the presence of TCEP can affect the MS fragmentation process. Another peptide, somatostatin, with an intra- disulfide bond, was heated in the presence of TCEP. Figure 7B shows backbone cleavages along the peptide chain. MS/MS analysis was performed for the individual peaks to confirm the identity of modified peptides. Besides the cysteine to alanine conversion (peaks 5 & 6 in Figure

7B), additional side reactions resulting from the heating of cysteine-containing peptides, the hydrolysis products such as formation of amide at the C-terminus and a pyruvoyl residue at the

N-terminus of a newly cleaved bond as same as of those from hydrolysis of dehydroalanine8 were observed (peaks 1 - 4 in Figure 7B). Surprisingly, as discussed above, heating disulfide cysteine in the presence of TCEP leading to the conversion of cysteine to alanine would not involve the generation of dehydroalanine, however, the hydrolysis products were detected as same as heating disulfide cysteine containing peptide alone. In addition, some other unidentified fragments were found. These fragments may have appeared because cysteine thiyl-radical can abstract a Cα hydrogen from the peptide backbone, eventually leading to peptide damage that is not restricted to the amino acid residue of the initial attack36, 37.

Interestingly, we have evaluated the rates of formation of these hydrolytic products and concluded that the hydrolytic cleavage is significantly enhanced in the presence of TCEP. After

20 minutes of heating with TCEP, the hydrolytic cleavage products (peaks 1 - 4 in Figure 7B)

185 can be observed. No hydrolytic cleavage products were detected in the sample without TCEP after 20 minutes of heating (Figure 7A) and were only observed after extending the heating time

(40 minutes). The mechanism leading to enhanced hydrolytic cleavage is at present unknown and could be examined by using UV radiation to initiate cysteine thiyl radical in the absence of heating to distinguish other fragmentation mechanisms caused by heating.

CONCLUSIONS In this paper, we selected a single disulfide-containing peptide, atriopeptin as a model compound to study cysteine modifications induced in the sample preparation for protein characterization. We examined two unusual cysteine modifications, i.e. conversion of cysteine to dehydroalanine and cysteine to alanine, caused by sample preparation. Our results indicated that since the disulfide bond containing peptides are prone to β-elimination, and thus the formation of dehydroalanine, special care should be taken to avoid exposure of disulfide bond containing samples to heat. Though the extent of β-elimination at room temperature is minimal, overnight protein digestion at 37 °C without the use of reducing reagent may lead to formation of peptide with dehydroalanine residue. In addition, it is expected that heat-induced protein denaturation without reduction (90 °C for 10 minutes) may lead to relatively high levels of desulfurization and thus should be avoided. Next, it was found that addition of reducing reagents, specifically DTT, is an effective way to prevent cysteine desulfurization since the reduction of disulfide bonds will prevent β-elimination under heating condition. On the other hand, it was found that though

TCEP is an effective reducing reagent at room temperature, artifact corresponding to the conversion of cysteine to alanine can occur after heating cysteine containing peptides in the presence of TCEP. Therefore, special care should be taken to avoid heating TCEP containing samples. In addition, it was found that hydrolytic cleavage of peptide bonds is significantly

186

enhanced by heating in the presence of TCEP and the mechanism for it is currently unknown.

Finally, we investigated the mechanism of this cysteine to alanine conversion and found the

reaction follows a radical-based desulfurization mechanism.

atriopeptin A NH NH NH –

– trypsin digestion – – NH NH NH –

H–C– – – – CH –S–S–CH – – 2 2 –C–H – CO CO disulfide bond linked peptide

H–C–– – intact peptide, CH2–S–S–CH– – C–H fragments, MH+= 2019.8 NSFR 2 CO MH+: 2386.1 CO NSFR

disulfide bridge i ii NH H– NH–NH H– NH–NH NH – – – –CO – CH2=C H–C–CH2–SH SH–CH2 -C–CO CO

H–C–CH– 2––S-S CO dehydroalanine, MH+=1275.6 + MH = 713.2 MH+= 1309.6 thiocysteine residue

187

original disulfide linked peptide B IGAQSGLGCNSFR SSCFGGR

x3 673.98 100 z=3 homodimers resulting from disulfide bond scrambling IGAQSGLGCNSFR 80 IGAQSGLGCNSFR 873.09 z=3

60

SSCFGGR

40 SSCFGGR

Relative Abundance Relative IGAQSGLGdANSFR 712.30 miscleaved 1010.97 638.33 z=2 digest z=2 20 z=2 801.56 655.07 Z=4 z=4

0 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 m/z

SSCFGGR original disulfide free cysteines linked peptide C (reduced form) IGAQSGLGCNSFR SSCFGGR IGAQSGLGCNSFR

homodimers

SSCFGGR IGAQSGLGCNSFR SSCFGGR IGAQSGLGCNSFR

miscleaved digest

Mass (m/z)

188

Figure 1. (A) Schematic of atriopeptin digested by trypsin without reduction to generate a disulfide linked peptide, and further subjected to (i) heat induced β-elimination and (ii) reduction;

(B) ESI Mass spectrum of atriopeptin digest (peak m/z 655.1 (+4) presents the same homodimer at m/z 873.1 (+3)) and (C) MALDI mass spectrum of atriopeptin digest. (dA represents dehydroalanine)

189

A

B

C

Mass (m/z)

Figure 2. MALDI mass spectra of (A) atriopeptin digested at 37 °C for 1 hour; (B) atriopeptin

digest heated at 60 °C for 1 hour; (C) MALDI MS/MS spectrum of a peak at m/z=1275.6

corresponding to the dehydroalanine containing peptide. (dA represents dehydroalanine)

190

A

B

C

Mass (m/z)

Figure 3. MALDI mass spectra resulting from heating of a disulfide containing peptide at 60°C

for 1 hour in the presence of reducing reagent (A) 5 mM DTT; (B) 5 mM TCEP; and (C)

MS/MS spectrum of the product of desulfurization at m/z 1277.6, corresponding to the

conversion of cysteine to alanine. Other conditions as in Figure 2.

191

A

B

C

Figure 4. MALDI mass spectra of (A) lysozyme digested by trypsin at 37 °C for 4 hours and

stored at room temperature; lysozyme digest heated in the presence of TCEP at 90°C (B) for 10

minutes; and (C) for 120 minutes. Insert shows a cysteine-containing peptide (peak 1) converted

to alanine product (peak 2) after heating in the presence of TCEP.

192

A

B

193

C

D

Masss(m/z)

Figure 5. MALDI MS of atriopeptin digest (A) heated at 90oC with TCEP in ammonium

o bicarbonate in D2O for 20 minutes; (B) heating at 90 C with TCEP in ammonium bicarbonate in

H2O for 20 minutes; (C) MS/MS of precursor ion at m/z 1278.6 from (A); and (D) MS/MS of

precursor ion at m/z 1277.6 from (B).

194

Figure 6. Proposed mechanisms for radical desulfurization

195

AG KNFFWKTFTS C C

A

32 Da

B

3 7 1 2 6 4 5

Mass (m/z)

Figure 7. MALDI mass spectra of (A) intact somatostatin heated at 90 º C for 30 minutes as

control (intact peptide at m/z 1637.7); (B) somatostatin heating at 90 º C for 30 minutes in the

presence of TCEP and shows identified peaks 1-7 as following: 1. CH3COCOKNFFWKTFTSA,

m/z=1447.7; 2. CH3COCOKNFFWKTFTSC, m/z=1479.7; 3. AGAKNFFWKTFTS-NH2,

m/z=1503.7; 4. AGCKNFFWKTFTS-NH2, m/z=1535.7; 5. AGAKNFFWKTFTSA,

m/z=1575.7; 6. AGCKNFFWKTFTSA and AGAKNFFWKTFTSC, m/z=1607.7; and 7.

AGCKNFFWKTFTSC (reduced somatostatin), m/z1639.7.

196

References

1. Leichert, L. I.; Jakob, U., Protein thiol modifications visualized in vivo. Plos Biology 2004, 2 (11),

1723-1737.

2. Stamler, J. S., Redox singnaling-nitrosylation and related target interactions of nitric-oxide. Cell

1994, 78 (6), 931-936.

3. Klatt, P.; Lamas, S., Regulation of protein function by S-glutathiolation in response to oxidative and nitrosative stress. European Journal of Biochemistry 2000, 267 (16), 4928-4944.

4. Arolas, J. L.; Aviles, F. X.; Chang, J. Y.; Ventura, S., Folding of small disulfide-rich proteins: clarifying the puzzle. Trends in Biochemical Sciences 2006, 31 (5), 292-301.

5. Bardwell, J. C. A., The dance of disulfide formation. Nature Structural & Molecular Biology

2004, 11 (7), 582-583.

6. Daggett, V.; Fersht, A. R., Is there a unifying mechanism for protein folding? Trends in

Biochemical Sciences 2003, 28 (1), 18-25.

7. Giles, N. M.; Watts, A. B.; Giles, G. I.; Fry, F. H.; Littlechild, J. A.; Jacob, C., Metal and redox modulation of cysteine protein function. Chemistry & Biology 2003, 10 (8), 677-693.

8. Cohen, S. L.; Price, C.; Vlasak, J., beta-elimination and peptide bond hydrolysis: Two distinct mechanisms of human IgG1 hinge fragmentation upon storage. Journal of the American Chemical Society

2007, 129 (22), 6976-+.

9. Pristatsky, P.; Cohen, S.; Krantz, D.; Acevedo, J.; Ionescu, R.; Vlasak, J., Evidence for Trisulfide

Bonds in a Recombinant Variant of a Human IgG2 Monoclonal Antibody. Analytical Chemistry 2009,

July 10, 2009, DOI: 10.1021/ac9006254.

10. CanovaDavis, E.; Baldonado, I. P.; Chloupek, R. C.; Ling, V. T.; Gehant, R.; Olson, K.;

GilleceCastro, B. L., Confirmation by mass spectrometry of a trisulfide variant in methionyl human growth hormone biosynthesized in Escherichia coli. Analytical Chemistry 1996, 68 (22), 4044-4051.

11. Wang, W.; Singh, S.; Zeng, D. L.; King, K.; Nema, S., Antibody structure, instability, and formulation. Journal of Pharmaceutical Sciences 2007, 96 (1), 1-26.

197

12. Kim, J. S.; Kim, H. J., Matrix-assisted laser desorption/ionization time-of-flight mass spectrometric observation of a peptide triplet induced by thermal cleavage of cystine. Rapid Commun

Mass Spectrom 2001, 15 (23), 2296-300.

13. Gorman, J. J.; Wallis, T. P.; Pitt, J. J., Protein disulfide bond determination by mass spectrometry.

Mass Spectrometry Reviews 2002, 21 (3), 183-216.

14. Loo, J. A., Eeffect of reduceing disulfide-containing proteins on elcetrospray ionizaton mass- spectra. Anal. Chem. 1990, (62 ), 693-698.

15. Yu, H. Q.; Murata, K.; Hedrick, J. L.; Almaraz, R. T.; Xiang, F.; Franz, A. H., The disulfide bond pattern of salmon egg lectin 24 K from the Chinook salmon Oncorhynchus tshawytscha (vol 463, pg 1,

2007). Archives of Biochemistry and Biophysics 2008, 479 (2), 179-179.

16. Yan, B. X.; Valliere-Douglass, J.; Brady, L.; Steen, S.; Han, M.; Pace, D.; Elliott, S.; Yates, Z.;

Han, Y. H.; Balland, A.; Wang, W. C.; Pettit, D., Analysis of post-translational modifications in recombinant monoclonal antibody IgG1 by reversed-phase liquid chromatography/mass spectrometry.

Journal of Chromatography A 2007, 1164 (1-2), 153-161.

17. Wu, S. L.; Jiang, H. T.; Lu, Q. Z.; Dai, S. J.; Hancock, W. S.; Karger, B. L., Mass Spectrometric

Determination of Disulfide Linkages in Recombinant Therapeutic Proteins Using Online LC-MS with

Electron-Transfer Dissociation. Analytical Chemistry 2009, 81 (1), 112-122.

18. Jones, M. D.; Patterson, S. D.; Lu, H. S., Determination of disulfide bonds in highly bridged disulfide-linked peptides by matrix-assisted laser desorption/ionization mass spectrometry with postsource decay. Analytical Chemistry 1998, 70 (1), 136-143.

19. Kim, H. I.; Beauchamp, J. L., Mapping Disulfide Bonds in Insulin with the Route 66 Method:

Selective Cleavage of S-C Bonds Using Alkali and Alkaline Earth Metal Enolate Complexes. Journal of the American Society for Mass Spectrometry 2009, 20 (1), 157-166.

20. Taylor, F. R.; Prentice, H. L.; Garber, E. A.; Fajardo, H. A.; Vasilyeva, E.; Pepinsky, R. B.,

Suppression of sodium dodecyl sulfate-polyacrylamide gel electrophoresis sample preparation artifacts for analysis of IgG4 half-antibody. Analytical Biochemistry 2006, 353 (2), 204-208.

198

21. Liu, H. C.; Gaza-Bulseco, G.; Chumsae, C.; Newby-Kew, A., Characterization of lower molecular weight artifact bands of recombinant monoclonal IgG1 antibodies on non-reducing SDS-PAGE.

Biotechnology Letters 2007, 29 (11), 1611-1622.

22. Pompach, P.; Man, P.; Kavan, D.; Hofbauerová, K.; Kumar, V.; Bezouška, K.; Havlíček, V.;

Novák, P., Modified electrophoretic and digestion conditions allow a simplified mass spectrometric evaluation of disulfide bonds. Journal of the American Society for Mass Spectrometry 2009.

23. Chelius, D.; Wimer, M. E. H.; Bondareriko, P. V., Reversed-phase liquid chromatography in-line with negative ionization electrospray mass spectrometry for the characterization of the disulfide-linkages of an immunoglobulin gamma antibody. Journal of the American Society for Mass Spectrometry 2006, 17

(11), 1590-1598.

24. Zubarev, R. A.; Kruger, N. A.; Fridriksson, E. K.; Lewis, M. A.; Horn, D. M.; Carpenter, B. K.;

McLafferty, F. W., Electron capture dissociation of gaseous multiply-charged proteins is favored at disulfide bonds and other sites of high hydrogen atom affinity. Journal of the American Chemical Society

1999, 121 (12), 2857-2862.

25. Baumgartner, C.; Rejtar, T.; Kullolli, M.; Akella, L. M.; Karger, B. L., SeMoP: a new computational strategy for the unrestricted search for modified peptides using LC-MS/MS data. J

Proteome Res 2008, 7 (9), 4199-208.

26. Friedman, M., Chemistry, biochemistry, nutrition, and microbiology of lysinoalanine, lanthionine, and histidinoalanine in food and other proteins. Journal of Agricultural and Food Chemistry 1999, 47 (4),

1295-1319.

27. Nakanishi, T.; Sato, T.; Sakoda, S.; Yoshioka, M.; Shimizu, A., Modification of cysteine residue in transthyretin and a synthetic peptide: analyses by electrospray ionization mass spectrometry.

Biochimica Et Biophysica Acta-Proteins and Proteomics 2004, 1698 (1), 45-53.

28. Bar-Or, R.; Rael, L. T.; Bar-Or, D., Dehydroalanine derived from cysteine is a common post- translational modification inhuman serum albumin. Rapid Commun. Mass Spectrom. 2008, 22 (5), 711-

716.

199

29. Gloss, L. M.; Spencer, D. E.; Kirsch, J. F., Cysteine-191 in aspartate aminotransferases appears to be conserved due to the lack of a neutral mutation pathway to the functional equivalent, alanine-191.

Proteins-Structure Function and Genetics 1996, 24 (2), 195-208.

30. Ohno, S., Ancient linkage groups and frozen accidents. Nature 1973, 244 259.

31. Mihara, H.; Esaki, N., Bacterial cysteine desulfurases: their function and mechanisms. Applied

Microbiology and Biotechnology 2002, 60 (1-2), 12-23.

32. Yan, L. Z.; Dawson, P. E., Synthesis of peptides and proteins without cysteine residues by native chemical ligation combined with desulfurization. Journal of the American Chemical Society 2001, 123 (4),

526-533.

33. Patterson, S. D.; Katta, V., Prompt fragmentation of disulfide-linked peptides during matrix- assisted laser desorption ionization mass spectrometry. Analytical Chemistry 1994, 66 (21), 3727-32.

34. Parker, A. J. K., N., The scission of the sulfur-sulfur bond. Chemical Reviews

1969, 59.

35. Wan, Q.; Danishefsky, S. J., Free-radical-based, specific desulfurization of cysteine: A powerful advance in the synthesis of polypeptides and glycopolypeptides. Angewandte Chemie-International

Edition 2007, 46 (48), 9248-9252.

36. Schoneich, C., Mechanisms of protein damage induced by cysteine thiyl radical formation.

Chemical Research in Toxicology 2008, 21 (6), 1175-1179.

37. Mozziconacci, O.; Williams, T. D.; Kerwin, B. A.; Schoneich, C., Reversible Intramolecular

Hydrogen Transfer between Protein Cysteine Thiyl Radicals and C-alpha-H Bonds in Insulin: Control of

Selectivity by Secondary Structure. Journal of Physical Chemistry B 2008, 112 (49), 15921-15932.

200

Table S1 chemical similarity matrix (CMS) developed

ARNDCQEGHI LKMFPSTWYVBZX‐ A4‐1 ‐2 ‐20‐1 ‐10‐2 ‐1 ‐1 ‐1 ‐1 ‐2 ‐110‐3 ‐20‐2 ‐10‐4 R ‐160‐2 ‐310‐20‐3 ‐25‐1 ‐3 ‐2 ‐1 ‐1 ‐3 ‐2 ‐3 ‐10‐1 ‐4 N ‐2061‐35001‐3 ‐30‐2 ‐3 ‐210‐4 ‐2 ‐330‐1 ‐4 D ‐2 ‐216‐306‐1 ‐1 ‐3 ‐4 ‐1 ‐3 ‐3 ‐10‐1 ‐4 ‐3 ‐341‐1 ‐4 C0‐3 ‐3 ‐39‐3 ‐4 ‐3 ‐3 ‐1 ‐1 ‐3 ‐1 ‐2 ‐3 ‐1 ‐1 ‐2 ‐2 ‐1 ‐3 ‐3 ‐2 ‐4 Q ‐1150‐352‐20‐3 ‐210‐3 ‐10‐1 ‐2 ‐1 ‐203‐1 ‐4 E ‐1006‐426‐20‐3 ‐31‐2 ‐3 ‐10‐1 ‐3 ‐2 ‐214‐1 ‐4 G0‐20‐1 ‐3 ‐2 ‐26‐2 ‐4 ‐4 ‐2 ‐3 ‐3 ‐20‐2 ‐2 ‐3 ‐3 ‐1 ‐2 ‐1 ‐4 H ‐201‐1 ‐300‐28‐3 ‐3 ‐1 ‐2 ‐1 ‐2 ‐1 ‐2 ‐22‐300‐1 ‐4 I ‐1 ‐3 ‐3 ‐3 ‐1 ‐3 ‐3 ‐4 ‐342‐310‐3 ‐2 ‐1 ‐3 ‐13‐3 ‐3 ‐1 ‐4 L ‐1 ‐2 ‐3 ‐4 ‐1 ‐2 ‐3 ‐4 ‐324‐220‐3 ‐2 ‐1 ‐2 ‐11‐4 ‐3 ‐1 ‐4 K ‐150‐1 ‐311‐2 ‐1 ‐3 ‐26‐1 ‐3 ‐10‐1 ‐3 ‐2 ‐201‐1 ‐4 M ‐1 ‐1 ‐2 ‐3 ‐10‐2 ‐3 ‐212‐150‐2 ‐1 ‐1 ‐1 ‐11‐3 ‐1 ‐1 ‐4 F ‐2 ‐3 ‐3 ‐3 ‐2 ‐3 ‐3 ‐3 ‐100‐306‐4 ‐2 ‐266‐1 ‐3 ‐3 ‐1 ‐4 P ‐1 ‐2 ‐2 ‐1 ‐3 ‐1 ‐1 ‐2 ‐2 ‐3 ‐3 ‐1 ‐2 ‐47‐1 ‐1 ‐4 ‐3 ‐2 ‐2 ‐1 ‐2 ‐4 S1‐110‐1000‐1 ‐2 ‐20‐1 ‐2 ‐141‐3 ‐2 ‐2000‐4 T0‐10‐1 ‐1 ‐1 ‐1 ‐2 ‐2 ‐1 ‐1 ‐1 ‐1 ‐2 ‐115‐2 ‐20‐1 ‐10‐4 W ‐3 ‐3 ‐4 ‐4 ‐2 ‐2 ‐3 ‐2 ‐2 ‐3 ‐2 ‐3 ‐16‐4 ‐3 ‐266‐3 ‐4 ‐3 ‐2 ‐4 Y ‐2 ‐2 ‐2 ‐3 ‐2 ‐1 ‐2 ‐32‐1 ‐1 ‐2 ‐16‐3 ‐2 ‐266‐1 ‐3 ‐2 ‐1 ‐4 V0‐3 ‐3 ‐3 ‐1 ‐2 ‐2 ‐3 ‐331‐21‐1 ‐2 ‐20‐3 ‐14‐3 ‐2 ‐1 ‐4 B ‐2 ‐134‐301‐10‐3 ‐40‐3 ‐3 ‐20‐1 ‐4 ‐3 ‐341‐1 ‐4 Z ‐1001‐334‐20‐3 ‐31‐1 ‐3 ‐10‐1 ‐3 ‐2 ‐214‐1 ‐4 X0‐1 ‐1 ‐1 ‐2 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐200‐2 ‐1 ‐1 ‐1 ‐1 ‐1 ‐4 ‐‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐4 ‐41

201

Table S2. The 3D structural alignment for the TbAUK1 model structure vs. Human Aurora A (PDB ID: 2j50). Each row represents a protein structure and each column represents a spatial position in the structural alignment. The boldface letters indicate the residues predicted to be functionally important using the top 8% of the POOL ranking. The aligned residues, which are different between TbAUK1 and human Aurora A are marked with an asterisk. 'y' indicates the important ligand contact residues reported in the literatures. 39 Columns containing important residues with key differences between TbAUK1 versus human aurora A are marked with an asterisk at the top of the column

202

Human Aurora A W128 A129 L130 E131 D132 F133 E134 I135 G136 R137 P138 L139 TbAUK1 ‐ ‐ ‐ ‐ D28 F29 E30 L31 L32 H33 K34 L35 ligand contact residues

* Human Aurora A G140 K141 G142 K143 F144 G145 N146 V147 Y148 L149 A150 R151 TbAUK1 G36 G37 G38 N39 Y40 G41 D42 V43 H44 L45 A46 S47 ligand contact residues yy y y

Human Aurora A E152 K153 Q154 S155 K156 F157 I158 L159 A160 L161 K162 V163 TbAUK1 V48 K49 D50 C51 N52 F53 V54 C55 A56 L57 K58 R59 ligand contact residues y

Human Aurora A L164 F165 V163 L164 F165 K166 A167 Q168 L169 E170 K171 A172 TbAUK1 L60 S61 R59 L60 S61 I62 K63 K64 L65 A66 D67 F68 ligand contact residues

Human Aurora A G173 V174 E175 H176 Q177 L178 R179 R180 E181 V182 E183 I184 TbAUK1 D69 I70 A71 T72 Q73 L74 R75 R76 E77 I78 E79 I80 ligand contact residues

Human Aurora A Q185 S186 H187 L188 R189 H190 P191 N192 I193 L194 R195 L196 TbAUK1 A81 F82 N83 T84 R85 H86 K87 Y88 L89 L90 R91 T92 ligand contact residues y

Human Aurora A Y197 G198 Y199 F200 H201 D202 A203 T204 R205 V206 Y207 L208 TbAUK1 Y93 A94 Y95 F96 F97 D98 E99 T100 D101 I102 Y103 L104 ligand contact residues * *** * Human Aurora A I209 L210 E211 Y212 A213 P214 L215 G216 T217 V218 Y219 R220 TbAUK1 I105 M106 E107 P108 C109 S110 N111 G112 M113 L114 Y115 T116 ligand contact residues yyyy y

Human Aurora A E221 L222 Q223 K224 L225 S226 K227 F228 D229 E230 Q231 R232 TbAUK1 E117 L118 N119 R120 V121 K122 C123 F124 A125 P126 P127 T128 ligand contact residues

Human Aurora A T233 A234 T235 Y236 I237 T238 E239 L240 A241 N242 A243 L244 TbAUK1 A129 A130 R131 Y132 V133 A134 Q135 L136 A137 E138 A139 L140 ligand contact residues

203

** Human Aurora A S245 Y246 C247 H248 S249 K250 R251 V252 I253 H254 R255 D256 TbAUK1 L141 Y142 L143 H144 Q145 H146 H147 I148 L149 H150 R151 D152 ligand contact residues

Human Aurora A I257 K258 P259 E260 N261 L262 L263 L264 G265 S266 A267 G268 TbAUK1 I153 K154 P155 E156 N157 I158 L159 L160 D161 H162 N163 N164 ligand contact residues yy yy

Human Aurora A E269 L270 K271 I272 A273 D274 F275 G276 W277 S278 V279 H280 TbAUK1 N165 I166 K167 L168 A169 D170 F171 G172 W173 S174 V175 H176 ligand contact residues yyy

Human Aurora A A281 P282 S283 S284 R285 R286 T287 T288 L289 C290 G291 T292 TbAUK1 D177 P178 D179 N180 R181 R182 K183 T184 S185 C186 G187 T188 ligand contact residues

Human Aurora A L293 D294 Y295 L296 P297 P298 E299 M300 I301 E302 G303 R304 TbAUK1 P189 E190 Y191 F192 P193 P194 E195 I196 V197 G198 R199 Q200 ligand contact residues

Human Aurora A M305 H306 D307 E308 K309 V310 D311 L312 W313 S314 L315 G316 TbAUK1 A201 Y202 D203 T204 S205 A206 D207 L208 W209 C210 L211 G212 ligand contact residues

Human Aurora A V317 L318 C319 Y320 E321 F322 L323 V324 G325 K326 P327 P328 TbAUK1 I213 F214 C215 Y216 E217 L218 L219 ‐ ‐ ‐ ‐ ‐ ligand contact residues

gate keeper hinge region catalytic loop activation loop

204

Table S3 . 3D structural alignment of TbrPDEB1 model, TbrPDEB2 model, and human PDEB4 (PDB ID: 1xm4) and L.major PDEB1 (PDB ID: 2r8q) using PDBefold .1 Colored residues are located in the binding pockets of the protein: metal binding pocket (pink); Q switch / P pocket (blue); and S pocket (green). Columns containing important residues with key differences between TbrPDEB1 versus human PDE4 are marked with an asterisk at the top of the column.

TbrPDEB1 K576 N577 V578 P579 S580 R581 A582 V583 K584 R585 V586 T587 A588 TbrPDEB2 ‐ ‐ ‐ ‐ ‐ ‐ ‐ r585 V586 T587 A588 I589 T590 HumanPDE4 ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ L.majorPDEB1 ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ V597 I598 A599 V600

TbrPDEB1 I589 T590 K591 V592 E593 R594 E595 A596 V597 L598 V599 C600 E601 TbrPDEB2 N591 R592 E593 R594 E595 A596 V597 L598 R599 I600 E601 F602 P603 HumanPDE4 ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ E163 D164 H165 L.majorPDEB1 T601 P602 E603 E604 R605 E606 A607 V608 M609 S610 I611 D612 F613

TbrPDEB1 L602 P603 S604 F605 ‐ ‐ D606 V607 T608 D609 V610 E611 F612 TbrPDEB2 N604 V605 D606 V607 ‐ ‐ T608 D609 I610 D611 ‐ ‐ F612 HumanPDE4 L166 A167 K168 E169 L170 E171 D172 L173 N174 K175 W176 G177 L178 L.majorPDEB1 G614 G615 A616 Y617 ‐ ‐ D618 F619 T620 S621 P622 G623 F624

TbrPDEB1 D613 L614 F615 R616 A617 ‐ ‐ R618 E619 S620 T621 D622 K623 TbrPDEB2 D613 L614 F615 Q616 A617 R618 E619 S620 T621 ‐ D622 ‐ K623 HumanPDE4 N179 I180 F181 N182 V183 A184 ‐ G185 Y186 ‐ S187 ‐ H188 L.majorPDEB1 N625 L626 F627 E628 V629 ‐ ‐ R630 E631 K632 Y633 S634 E635

TbrPDEB1 P624 L625 D626 V627 A628 A629 A630 I631 A632 Y633 R634 L635 L636 TbrPDEB2 P624 L625 D626 V627 A628 A629 A630 I631 A632 Y633 R634 L635 L636 HumanPDE4 N189 ‐ R190 P191 L192 T193 c194 I195 M196 Y197 A198 I199 F200 L.majorPDEB1 P636 M637 D638 A639 A640 A641 G642 V643 V644 Y645 N646 L647 L648

TbrPDEB1 L637 G638 S639 G640 L641 P642 Q643 K644 F645 G646 C647 S648 D649 TbrPDEB2 L637 G638 S639 G640 L641 P642 Q643 K644 F645 G646 C647 S648 D649 HumanPDE4 Q201 E202 R203 D204 L205 L206 K207 T208 F209 R210 I211 S212 S213 L.majorPDEB1 W649 N650 S651 G652 L653 P654 E655 K656 F657 G658 C659 R660 E661

TbrPDEB1 E650 V651 L652 L653 N654 F655 I656 L657 Q658 C659 R660 K661 K662 TbrPDEB2 E650 V651 L652 L653 N654 F655 I656 L657 Q658 C659 R660 K661 K662 HumanPDE4 D214 T215 F216 I217 T218 Y219 M220 M221 T222 L223 E224 D225 H226 L.majorPDEB1 Q662 T663 L664 L665 N666 F667 I668 L669 Q670 C671 R672 R673 R674

TbrPDEB1 Y663 R664 N665 ‐ V666 P667 Y668 H669 N670 F671 Y672 H673 V674 TbrPDEB2 Y663 R664 N665 ‐ V666 P667 Y668 H669 N670 F671 Y672 H673 V674 HumanPDE4 Y227 H228 S229 D230 V231 A232 Y233 H234 N235 S236 L237 H238 A239 L.majorPDEB1 Y675 R676 R677 ‐ V678 P679 Y680 H681 N682 F683 Y684 H685 V686

205

TbrPDEB1 V675 D676 V677 C678 Q679 T680 I681 H682 T683 F684 L685 Y686 R687 TbrPDEB2 V675 D676 V677 C678 Q679 T680 I681 Y682 T683 F684 L685 Y686 R687 HumanPDE4 A240 D241 V242 A243 Q244 S245 T246 H247 V248 L249 L250 S251 T252 L.majorPDEB1 V687 D688 V689 C690 Q691 T692 L693 H694 T695 Y696 L697 Y698 T699

TbrPDEB1 G688 N689 V690 Y691 E692 K693 L694 T695 E696 L697 E698 C699 F700 TbrPDEB2 G688 N689 V690 Y691 E692 K693 L694 T695 E696 L697 E698 C699 F700 HumanPDE4 P253 A254 L255 D256 A257 V258 F259 T260 D261 L262 E263 I264 L265 L.majorPDEB1 G700 K701 A702 S703 E704 L705 L706 T707 E708 L709 E710 C711 Y712

TbrPDEB1 V701 L702 L703 I704 T705 A706 L707 V708 H709 D710 L711 D712 H713 TbrPDEB2 V701 L702 L703 I704 T705 A706 L707 V708 H709 D710 L711 D712 H713 HumanPDE4 A266 A267 I268 F269 A270 A271 A272 I273 H274 D275 V276 D277 H278 L.majorPDEB1 V713 L714 L715 V716 T717 A718 L719 V720 H721 D722 L723 D724 H725

* TbrPDEB1 M714 G715 L716 N717 N718 S719 F720 Y721 L722 K723 T724 E725 S726 TbrPDEB2 M714 G715 L716 N717 N718 S719 F720 Y721 L722 K723 T724 E725 S726 HumanPDE4 P279 G280 V281 S282 N283 Q284 F285 L286 I287 N288 T289 N290 S291 L.majorPDEB1 M726 G727 V728 N729 N730 S731 F732 Y733 L734 K735 T736 D737 S738

TbrPDEB1 P727 L728 G729 I730 L731 S732 S733 A734 S735 G736 N737 T738 S739 TbrPDEB2 P727 L728 G729 I730 L731 S732 S733 A734 S735 G736 N737 K738 S739 HumanPDE4 E292 L293 A294 L295 M296 Y297 N298 D299 ‐ ‐ ‐ E300 S301 L.majorPDEB1 P739 L740 G741 I742 L743 S744 S745 A746 S747 G748 N749 N750 S751

TbrPDEB1 V740 L741 E742 V743 H744 H745 C746 N747 L748 A749 V750 E751 I752 TbrPDEB2 V740 L741 E742 V743 H744 H745 C746 N747 L748 A749 V750 E751 I752 HumanPDE4 V302 L303 E304 N305 H306 H307 L308 A309 V310 G311 F312 K313 L314 L.majorPDEB1 V752 L753 E754 V755 H756 H757 C758 S759 L760 A761 I762 E763 I764

TbrPDEB1 L753 S754 D755 P756 E757 S758 D759 V760 F761 D762 G763 L764 E765 TbrPDEB2 L753 S754 D755 P756 E757 S758 D759 V760 F761 G762 G763 L764 E765 HumanPDE4 L315 Q316 E317 E318 H319 C320 D321 I322 F323 M324 N325 L326 T327 L.majorPDEB1 L765 S766 D767 P768 A769 A770 D771 V772 F773 E774 G775 L776 S777

TbrPDEB1 G766 A767 E768 R769 T770 L771 A772 F773 R774 S775 M776 I777 D778 TbrPDEB2 G766 A767 E768 R769 T770 L771 A772 F773 R774 S775 M776 I777 D778 HumanPDE4 K328 K329 Q330 R331 Q332 T333 L334 R335 K336 M337 V338 I339 D340 L.majorPDEB1 G778 Q779 D780 V781 A782 Y783 A784 Y785 R786 A787 L788 I789 D790

206

TbrPDEB1 C779 V780 L781 A782 T783 D784 M785 A786 K787 H788 G789 S790 A791 TbrPDEB2 C779 V780 L781 A782 T783 D784 M785 A786 R787 H788 S789 E790 F791 HumanPDE4 M341 V342 L343 A344 T345 D346 M347 S348 K349 H350 M351 S352 L353 L.majorPDEB1 C791 V792 L793 A794 T795 D796 M797 A798 K799 H800 A801 D802 A803

TbrPDEB1 L792 E793 A794 F795 L796 A797 S798 A799 ‐ A800 ‐ ‐ D801 TbrPDEB2 L792 E793 K794 Y795 L796 E797 L798 M799 K800 T801 S802 Y803 N804 HumanPDE4 L354 A355 D356 L357 K358 T359 M360 V361 ‐ E362 ‐ ‐ T363 L.majorPDEB1 L804 S805 R806 F807 T808 E809 L810 A811 ‐ T812 ‐ ‐ S813

TbrPDEB1 Q802 ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ S803 S804 ‐ TbrPDEB2 V805 ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ D806 ‐ ‐ HumanPDE4 K364 K365 V366 T367 S368 S369 G370 V371 L372 L373 L374 D375 ‐ L.majorPDEB1 G814 ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ F815 E816 K817

TbrPDEB1 D805 E806 A807 A808 F809 H810 R811 M812 T813 M814 E815 I816 I817 TbrPDEB2 ‐ D807 S808 D809 H810 R811 Q812 M813 T814 M815 D816 V817 L818 HumanPDE4 ‐ N376 Y377 T378 D379 R380 I381 Q382 V383 L384 R385 N386 M387 L.majorPDEB1 D818 N819 D820 T821 H822 R823 R824 L825 V826 M827 E828 T829 L830

* TbrPDEB1 L818 K819 A820 G821 D822 I823 S824 N825 V826 T827 K828 P829 F830 TbrPDEB2 M819 K820 A821 G822 D823 I824 S825 N826 V827 T828 K829 P830 F831 HumanPDE4 V388 H389 C390 A391 D392 L393 S394 N395 P396 T397 K398 S399 L400 L.majorPDEB1 I831 K832 A833 G834 D835 V836 S837 N838 V839 T840 K841 P842 F843

*** TbrPDEB1 D831 I832 S833 R834 Q835 W836 A837 M838 A839 V840 T841 E842 E843 TbrPDEB2 D832 I833 S834 R835 Q836 W837 A838 M839 A840 V841 T842 E843 E844 HumanPDE4 E401 L402 Y403 R404 Q405 W406 T407 D408 R409 I410 M411 E412 E413 L.majorPDEB1 E844 T845 S846 R847 M848 W849 A850 M851 A852 V853 T854 E855 E856

TbrPDEB1 F844 Y845 R846 Q847 G848 D849 M850 E851 K852 E853 R854 G855 V856 TbrPDEB2 F845 Y846 R847 Q848 G849 D850 M851 E852 K853 E854 R855 G856 V857 HumanPDE4 F414 F415 Q416 Q417 G418 D419 K420 E421 R422 E423 R424 G425 M426 L.majorPDEB1 F857 Y858 R859 Q860 G861 D862 M863 E864 K865 E866 K867 G868 V869

* TbrPDEB1 E857 ‐ ‐ V858 L859 P860 M861 F862 D863 R864 S865 K866 N867 TbrPDEB2 E858 V859 L860 P861 M862 ‐‐ F863 D864 R865 S866 K867 N868 HumanPDE4 E427 ‐ ‐ I428 S429 P430 M431 C432 D433 K434 H435 T436 ‐ L.majorPDEB1 E870 ‐ ‐ V871 L872 P873 M874 F875 D876 R877 S878 K879 N880

207

ZHOUXI (JOSIE) WANG 331 Huntington Ave. Apt 33, Boston, MA 02115 617-939-7527 [email protected]

SUMMARY

• Over 2 years experience in protein comparative modeling, docking and molecular dynamics simulations • Strong skills in establishing various model types (homology, pharmacophore, and QSAR) to prioritize synthesis, derivatization, or high throughput screening of drug candidates • Experience with molecular modeling software (YASARA, Schrödinger, Discovery studio, OpenEye Scientific Software, PyMOL and Pipeline polite), bioinformatics software (including SEQUEST and MASCOT) and proteomics/molecular biology databases • Solid background in bio-analysis and medicinal chemistry • Proven track record in both computational and experimental research

EDUCATION Ph. D. in Chemistry Expected June 2012 Department of Chemistry, Northeastern University, Boston, MA, USA Advisor: Mary Jo Ondrechen, Ph. D. Professor in Chemistry and Chemical Biology Advisor: Michael P. Pollastri, Ph. D. Associate Professor in Chemistry and Chemical Biology GPA: 3.73

M. E. in Material Science June 2006 College of Material Science and Engineering, Donghua University, Shanghai, China Advisor: Jiongxin Zhao, Ph. D. Professor in Material Science and Chemistry, Chair of Chemistry College Ranking in graduating class: 2nd out of 34

B. E. in Polymer Science June 2003 College of Material Science and Engineering, Donghua University, Shanghai, China Advisor: Jiongxin Zhao, Ph. D. Professor in Material Science Ranking in graduating class: 4th out of 99

PROFESSIONAL RESEARCH EXPERIENCE Research Assistant, Chemistry and Chemical Biology Department, Northeastern University, Boston, MA • Molecular modeling for computational guidance of drug design for neglected tropical Jan. 2010-Present diseases (project sponsored by NIH) → Utilize computational structure-based and ligand-based methods for drug design → Build homology models, perform docking and molecular dynamics simulations for aurora kinases and phosphodiesterases as targets for trypanosomal parasites → Apply structural biology information and molecular modeling techniques to establish structure activity relationships between protein and inhibitors

• Predicting biochemical function of Structural Genomics proteins of unknown function (project Jan. 2010-Present sponsored by NSF) → Utilize computational chemistry tools to predict protein active site residues in protein structures → Develop and implement new molecular modeling methods for protein biochemical function analysis → Predict function of superfamily members of previously unknown function

Research Assistant, Barnett Institute of Chemical and Biological Analysis, Northeastern University, Boston, MA • Micro-system for biomarker detection (project sponsored by Keck Foundation) June 2008- Dec. 2009 → Developed and optimized protocol for immobilization of antibodies to nanoparticles → Validated the quality of antibody coated nanoparticles using light scattering and Enzyme-linked immunosorbent assay (ELISA) → Prepared and purified fluorescently labeled antibodies for assay quantitation → Utilized nanoparticles assembled microchips to develop and examine immunoassay protocols → Recorded and communicated scientific findings with internal and external collaborators 208

• Cysteine desulfurization chemistry (sponsored by NIH) June 2008- Dec. 2009 → Characterized peptide modifications using multi-enzymatic digestion → Labeled peptides with deuterium to test hypothesis → Quantified sulfhydryl groups by Ellman’s reagent → Analyzed peptides quantitatively by mass spectrometry (MS)

• Transcriptional activator peptide (TAT) peptides projects Mar. 2008- Dec. 2008 → Characterized TAT peptides, polymer conjugated TAT and phospholipids conjugated-polymer-TAT analysis by MS → Prepared data analysis report and investigated the proteolytic kinetics

• Protein Forest digital ProteomeChip™ (Protein Forest Inc., MA) beta-testing (project sponsored June 2007- Dec. 2007 by Protein Forest Inc.) → Optimized protocol of ProteomeChip™ , a gel-based electrophoresis technique (Isoelectric focusing) combined with MS → Verified the efficiency and reproducibility of the methodology → Assisted method development and conducted experiments using ProteomeChip™ for shotgun proteomics application

Graduate Researcher, State Key Laboratory for Modification of Chemical Fibers and Polymer Materials, College of Material Science and Engineering, Donghua University, Shanghai, China • Core-Shell biocompatible polymeric nanoparticles Sep. 2003- Mar. 2006 → Developed two protocols for polymeric nanoparticle fabrication → Synthesized Core-Shell biocompatible polymeric nanoparticles → Characterized nanoparticles by light scattering and scanning electron microscope → Published two research papers based on the project

Undergraduate Researcher, Key Laboratory for Macromolecular Engineering of Polymer in Ministry of Education, Fudan University, Shanghai, China • Core-Shell polymeric micelles self-assembly Feb. 2003- July 2003 → Prepared and constructed non-covalently connected polymer micelles

PUBLICATIONS

• Zhouxi Wang, Mary Jo Ondrechen, Structurally Aligned Local Sites of Activity (SALSAs) for Functional Characterization of Enzyme Structures in the Concanavalin A-like Lectins/Glucanases Superfamily (manuscript in preparation) 2012

• Zhouxi Wang, Pengcheng Yin, Joslynn S. Lee, Ramya Parasuram, Srinivas Somarowthu, and Mary Jo Ondrechen, Protein Function Annotation with Structurally Aligned Local Sites of Activity (SALSAs) (submitted)

• Stefan Ochiana*, Vidya Pandarinath*, Zhouxi Wang*, Rishika Kapoor, Mary Jo Ondrechen, Larry Ruben, Michael P. Pollastri, The human Aurora kinase inhibitor danusertib is a lead compound for anti-trypanosomal drug discovery via target repurposing (submitted, * contributed equally)

• Nicholas D. Bland, Cuihua Wang, Craig Tallman, Alden E. Gustafson, Zhouxi Wang, Trent D. Ashton, Stefan O. Ochiana, Gregory McAllister, Kristina Cotter, Anna P. Fang, Lara Gechijian, Norman Garceau, Ranjiv Gangurde, Ron Ortenberg, Mary Jo Ondrechen, Robert K. Campbell, and Michael P. Pollastri, Pharmacological Validation of Trypanosoma brucei Phosphodiesterases B1 and B2 as Druggable Targets for African Sleeping Sickness, Journal of Medicinal Chemistry 2011, 54: 8188-8194 , • Zhouxi Wang, Tomas Rejtar, Zhaohui Sunny Zhou, Barry L. Karger, Desulfurization of Cysteine-Containing Peptides Resulting from Sample Preparation for Protein Characterization by MS, Rapid Communications in Mass Spectrometry 2010, 24(3): 267-275

• Jacob Grunwald, Tomas Rejtar, Rupa Sawant, Zhouxi Wang, Vladimir P. Torchilin, TAT Peptide and Its Conjugates: Proteolytic Stability Bioconjugate Chemistry 2009, 20 (8): 1531–1537

• Youwei Zhang, Zhouxi Wang, Yansong Wang, Jiongxin Zhao and Chengxun Wu, Facile preparation of pH-responsive gelatin-based core–shell polymeric nanoparticles at high concentrations via template polymerization, Polymer 2007, 48(19): 5639-5645

• Zhouxi Wang, Youwei Zhang, Yansong Wang, Jiongxin Zhao, Chengxun Wu, A Way to Prepare Core-Shell Biocompatible Polymeric

209

Nano-particles from Gelatin and Acrylic acid, Journal of Macromolecular Science 2006, 43(11): 1779-1786

• Youwei Zhang, Ming Jiang, Jiongxin Zhao, Zhouxi Wang, Hongjin Dou, Daoyun Chen, pH Responsive Core-Shell Particles and Hollow Sphere Attained by Macro molecular Self-Assembly, Langmuir 2005, 21(4): 1531-1538.

PRESENTATIONS

• 243rd American Chemical Society National Meeting & Exposition 2012, March, Anaheim, CA Poster: Structurally Aligned Local Sites of Activity (SALSAs) for functional characterization of enzyme structures in the Concanavalin A-like lectins/glucanases superfamily Zhouxi Wang and Mary Jo Ondrechen Poster: Identification of small-molecule inhibitors of the trypanosomal kinase TbAUK1 based on comparative modeling Zhouxi Wang, Stefan O. Ochiana, Vidya Pandarinath, Larry Ruben, Mary Jo Ondrechen, and Michael P. Pollastri Oral: Structurally aligned local sites of activity (SALSA): Better functional annotation through chemistry Mary Jo Ondrechen, Zhouxi Wang, Pengcheng Yin, Joslynn S Lee, Ramya Parasuram

• 25th Anniversary Symposium of The Protein Society, 2011, July, Boston, MA Poster: Structurally Aligned Local Sites of Activity (SALSAs) for Functional Characterization of Protein Structures in the Concanavalin A-like Lectins/Glucanases Superfamily Zhouxi Wang, Srinivas Somarowthu and Mary Jo Ondrechen

• Automated Function Prediction/CAFA 2011, a satellite meeting of the International Society for Computational Biology, 2011 July, Vienna, Austria Poster: Protein Function Annotation with Structurally Aligned Local Sites of Activity (SALSAs) Srinivas Somarowthu, Joslynn S. Lee, Zhouxi Wang, Ramya Parasuram, Pengcheng Yin and Mary Jo Ondrechen

• Research and Technology Scholarship Expo 2011, April, Boston, MA Poster: Homology modeling and docking of TbAUK1-- Computational guidance for the development of a T. brucei therapeutic Zhouxi Wang, Stefan O. Ochiana, Vidya Pandarinath, Larry Ruben, Mary Jo Ondrechen, and Michael P. Pollastri Poster: Design and synthesis of trypanosomal Aurora kinase inhibitors based on the 1,4,5,6 tetrahydropyrrolo [3,4-c] pyrazole scaffold for treatments for African sleeping sickness Stefan O. Ochiana, Viya Pandarinath, Zhouxi Wang, Mary Jo. Ondrechen, Larry Ruben, and Michael P. Pollastri

• 241th American Chemical Society National Meeting & Exposition 2011, March, Anaheim, CA Poster: Structurally Aligned Local Sites of Activity (SALSAs): Application of computational chemistry to functional annotation of structural genomics proteins Srinivas Somarowthu, Joslynn S. Lee, Pengcheng Yin, Ramya Parasuram, Zhouxi Wang, and Mary Jo Ondrechen

• 240th American Chemical Society National Meeting & Exposition 2010, August, Boston, MA Poster: Comparative modeling for Trypanosoma TOR Kinase Domain: the implantation of T.brucei drug design Zhouxi Wang, Caitlin E. Hubbard, Mary Jo Ondrechen, and Michael P. Pollastri Poster: Inhibitors of trypanosomal Aurora kinases as an approach for treatments for African sleeping sickness Stefan O. Ochiana, Viya Pandarinath, Zhouxi Wang, Mary Jo Ondrechen, Larry Ruben, and Michael P. Pollastri

• ISBM of the International Society for Computational Biology, 2010, June, Boston, MA Poster: Comparative modeling for Trypanosoma TOR Kinase Domain: the implantation of T.brucei drug design Zhouxi Wang, Mary Jo Ondrechen, and Michael P. Pollastri

• Research and Technology Scholarship Expo 2010, April, Boston, MA Poster: Comparative Model Structures for Trypanosoma TOR Kinase Domain Zhouxi Wang, Mary Jo Ondrechen, and Michael P. Pollastri

• American Society for Mass Spectrometry 2009, June, Philadelphia, PA Poster: Sample Preparation Induced Modifications of Cysteine Zhouxi Wang, Tomas Rejtar, Zhaohui Sunny Zhou and Barry L. Karger

• International Conference on Advanced Fiber and Polymer Materials, 2005, October, Shanghai, China Oral: A New Way to Prepare Core-Shell Biocompatible Polymeric Nano-particles from Gelatin and Acrylic acid Zhouxi Wang, Youwei Zhang, Jiongxin Zhao 210

TEACHING EXPERIENCE

Northeastern University, Boston, MA • Mentor to undergraduate and Masters student research interns May 2010 - May 2011 → Taught, assisted, and supervised undergraduate and graduate student researchers in the areas of structure-based techniques of drug discovery and molecular modeling • Teaching assistant in Chemistry Sep. 2006 - May 2008 & Jan 2010 - Dec. 2010 → Taught undergraduate Physical Chemistry and General Chemistry laboratory courses for groups of up to 30 → Graded homework and lab reports on a weekly basis → Served as a tutor for undergraduate students Donghua University, Shanghai, China • Mentor to undergraduate students Jan. 2005- Mar. 2006 → Taught and assisted undergraduate interns in basic polymerization experiments and lab report writing

RELEVANT SKILLS

• Molecular Modeling: homology modeling, docking, pharmacophore modeling, MD simulations, and electrostatics calculations • Operating Systems: Windows, Linux, and Mac • Programming Languages: C++ and Python • Protein Chemistry and Biotechnology: protein enzymatic digestion, peptide mapping, electrophoresis, ELISA, protein assays, liquid chromatography (LC), and Mass spectrometry (MS) including MALDI-TOF/TOF-MS (ABI), LCQ/LTQ (Thermo), and qTof (Agilent)

INTERNSHIPS

• Management Assistant, GUJIN Co., Ltd. (Beijing, China) July 2005- Aug. 2005 • Assistant Engineer, DuPont China (Suzhou, Jiangsu, China) July 2002- Aug. 2002

PROFESSIONAL SOCIETIES, LEADERSHIP, COMMUNITY INVOLVEMENT

• Member of American Chemical Society, Protein Society, and Association for Women in Science 2009-Present • Volunteer for Applied Pharmaceutical Chemistry Conference, Boston 2011 • Volunteer for Chemistry Department open house, Northeastern University, Boston 2007-2011 • Volunteer for MicroScale Bioseparation Conference, Boston 2009 • Student Leader of Students' Union, Graduate College, Donghua University 2003-2006 • Volunteer for nursing the elderly in TianSan Nursing house, Shanghai 2003-2005 • Head of Science Department in Students’ Union, College of Material Science, Donghua University 2001-2002

AWARDS AND HONORS • Sangma scholarship, Hong Kong, China 2001 • People's scholarship, Donghua University, China 2001-2002 • Caiyan scholarship, Donghua University, China 2000 • Scholarship for Academic Excellence 2000

REFERENCES Mary Jo Ondrechen, Ph. D. Professor in Chemistry and Chemical Biology, Northeastern University Tel: 617-373-2856 [email protected] Michael P. Pollastri, Ph. D. Associate Professor in Chemistry and Chemical Biology, Northeastern University Tel: 617-373-2703 [email protected] Zhaohui Sunny Zhou, Ph. D. Professor in Chemistry and Chemical Biology, Northeastern University Tel: 617 373-4818 [email protected]

211