Applications of molecular modeling techniques in the design of xanthine based adenosine receptor antagonists and the development of the protein function annotation method SALSA

by Joslynn S. Lee

B.S. in Chemistry-Biochemistry and Cellular Molecular Biology, Fort Lewis College

A dissertation submitted to:

The Faculty of the College of Science of Northeastern University In partial fulfillment of the requirements for the degree of Doctor of Philosophy

February 21, 2014

Dissertation directed by Mary Jo Ondrechen Professor of Chemistry and Chemical Biology

Graham B. Jones Professor of Chemistry and Chemical Biology

© 2014

Joslynn S. Lee

ALL RIGHTS RESERVED Dedication

I dedicate this dissertation to my grandparents, Susie & William Pino and Tom & Alice Lee, who inspired me to get a higher education since they were not given the opportunity to go to college.

On the Navajo reservation, shinálí asdzáá (my maternal grandmother) would take me to herd goats and pick wild tea (Thelesperma sp). She would use this plant would help with stomach aches and digestion. This was my first memorable experience that peaked my interest in science.

Their stories and beliefs passed down to me, have shaped the person I am today.

Ahéhee’ and Dawaeh (thank you in Navajo and Kerese)

Hózhóogo naasháa doo (In beauty I walk)

Shitsijí' hózhóogo naasháa doo (With beauty before me I walk)

Shikéédéé hózhóogo naasháa doo (With beauty behind me I walk)

Shideigi hózhóogo naasháa doo (With beauty above me I walk)

T'áá altso shinaagóó hózhóogo naasháa doo (With beauty around me I walk)

Hózhó náhásdlíí' (It has become beauty again)

Da wa `eh, he numeh, Joslynn Lee, shawiiti hanu stah `che, Naya eh, Na sge ya, ma mie, stru gu na ma, eh za, uum`atsi, du dra ne, ha ya, s`au du me`tra, no ya zi, eh sw ue no 'ta.

(Thank-you, from me, Joslynn Lee of the Parrot Clan of my people. To our Mother Earth and Father Creator, I'm thankful for the blessings and help bestowed upon me to complete my education.)

Acknowledgements

I want to thank my advisor, mentor and friend Professor Mary Jo Ondrechen for allowing me to join her lab to conduct research. Mary has always shown unconditional support and patience during my time at NU. I want to thank my co-advisor, mentor and friend, Professor Graham

Jones for building a collaboration for molecular modeling and research for the A2AAR project

(Part I) and support during my time at NU. I want to extend my gratitude, feedback and support to my thesis committee members Professor Carla Mattos and Professor Gene Cooperman. Their guidance has been invaluable.

I cannot give enough thanks to previous ORG members: Dr. Srinivas Somarowthru, Dr. Jaeju

Ko, Dr. Leo Murga, Dr. Heather Brodkin, Dr. Pengcheng Yin and Dr. Zhouxi Wang, for their help, patience and discussions for my research. I thank the current ORG members Dr. Ramya

Parasuram, Caitlyn Mills, Lisa Ngu for their time discussion my research and for their support!

My research would not be a great story without the collaborations for the A2AAR and SALSA projects. Rhiannon Thomas-Tran is an amazing individual who helped get the A2AAR project going and also Vincent Chevalier for the scale-up of molecules for in vitro studies. The in vitro studies were performed in the lab of Professor Michail Sitkovsky by Dr. Stephen Hatfield and

Dr. Kaisa Selesniemi. The SALSA-DT program could not have been written or optimized without the help from Rohan Garg (CCIS), Liang Tian (Math), Jiajun Cao (CCIS), Professor

Gene Cooperman (CCIS) and Professor Alexandru Suciu (Math).

The wonderful staff in the Chemistry and Chemical Biology department, Jean Harris, Cara

Shockley, Richard Pumphrey and Andrew Bean, they were happy and helpful individuals. I want to thank my undergraduate professors who made an impact on my research career. From

FLC, Dr. Leslie Sommerville, Dr. Ron Estler and Dr. Rob Milofsky, were the first to encourage me to pursue research in chemistry. Les for his unconditional support and continued mentoring.

Dr. Harry Higgs from Dartmouth University who let me conduct summer research in his lab and give me a preview of graduate school level work, pairing me up with Dr. Susie Nicholson-

Dykstra. She helped me build my confidence as a budding scientist!

Before applying to grad school, Dr. Christopher Brummel from Vertex Pharmaceuticals, assigned me challenging projects to tackle as a research associate but most importantly encouraged me to leave my position to pursue graduate level studies full-time. Dr. Daniel Jay, from Tufts Med School, from the Academy for Future Faculty (AFF) provided his support, along with the amazing cohort of doctoral students, who all inspire me to pursue my goal of becoming a professor.

My family has been pivotal in my pursuit of a graduate degree, becoming the first Ph.D. My sister, Rhiannon Lee, who provided her unconditional love and support during graduate school.

To her wife, Eliza, for providing me with laugh and constant support, especially a place to stay while writing my dissertation. My parents, Thomas & Valeria Lee, for everything. They believed in me and instilled my cultural beliefs that have shaped the woman I am today. Their continuous support to further my education. My brother Clifton, his wife KC and their kids for being my cheerleading squad filled with unconditional love!

My projects were financially supported by the National Science Foundation under grants MCB-

0843603/1158176 and CHE-1305655. Receiving the prestigious NSF graduate research fellowship allowed me to share my results internationally and necessary instrumentation in lab.

Other funding sources include Dana Farber Institute and IGERT-NU. Abstract of Dissertation

In the field of molecular modeling, theoretical and computational methods are used to study biological structures, dynamics and interactions. Using functional site predictors to identify the binding sites in 3D protein structures provides information that can be used in the fields of drug design and protein function annotation. This dissertation is divided in two parts, each describes the application of molecular modeling techniques in the development of antitumor immunotherapies to target the human A2A adenosine receptor and the development of an automated program, Structurally Aligned Local Sites of Activity (SALSA), to assign function based on local sites of alignment and chemical properties.

Cancer cells undergo tumor hypoxia in the earlier stages of growth. Hypoxic tumors generally have poor prognosis and can become resistant to traditional radiation and chemotherapy treatments. In the tumor microenvironment, an overaccumulation of adenosine is produced through the hypoxia-adenosinergic tissue-protecting mechanism that activates A2A adenosine receptors (A2AAR) on the surface of surrounding cells, leading to the protection of cancerous tissues. Targeting the A2AAR with antagonists will disrupt A2AAR signaling, thus preventing the protection of tumor cells. Molecular modeling techniques were applied to the design and optimization of xanthine-based A2AAR antagonists. In silico docking studies, using the programs AutoDock 4.0 and Schrodinger’s GLIDE, yielded similar and favorable binding poses to identify a lead compound. The lead compound was converted to a PEG derivative tested in two in vitro functional binding assays to confirm efficacy. The lead compound performed better than a previously known antagonist.

In 2000, the Protein Structure Initiative (PSI) was launched to make the 3D structures of proteins easily attainable from the knowledge of their corresponding DNA sequences. The PSI project generated a large set of 3D structures named structural genomics proteins. To date, there are 12,900 structural genomics (SG) protein structures reported in the Protein Data Bank. These

SG proteins were annotated (assigned function) using sequence or structural similarity matching tools. Unfortunately, a number of these SG proteins are listed with unknown, putative and hypothetical function. The computational method, Structurally Aligned Local Sites of Activity

(SALSA) is able to reliably assign function to these SG proteins. The strategy of SALSA is to define functional subclasses of characterized members from a protein superfamily. Functional site predictors identify active site residues to create unique signatures for each subclass.

Incorporating multiple structure alignment tools, a local spatial pattern of the active site residues is established for each functional subclass and this can be used to sort a superfamily according to biochemical function. An automated version of SALSA, SALSA-DT, utilizes Delaunay triangulation to better match the local active sites of 3D protein structures for faster structural alignments within the superfamily. The application of the SALSA and SALSA-DT methods applied to the ribulose phosphate binding barrel (RPBB) superfamily, successfully sorts the superfamily and 27 SG proteins are evaluated. Additionally, SALSA revealed more information about the structural architecture of the active sites within the superfamily.

Table of Contents

Dedication …………………………………………………………………………… iii

Acknowledgements ……………..………………………………………………………... iv

Abstract of Dissertation …………………………………………………………………. vi

Table of Contents ……..………………………………………………………..………. viii

List of Figures …………………………………………………………………………… xv

List of Tables …………………………………………………………………………… xxii

List of Abbreviations …….………………………………………………………………… xxv

Part I – Using molecular modeling techniques in designing small molecules to target the human A2A adenosine receptor as anti-tumor immunotherapies

Chapter 1: Introduction to molecular modeling in drug discovery ………………………. 1

1.1 Overview of current molecular modeling tools in drug discovery ………………….. 2

1.2 Computationally guided ligand design ………………………………………………. 4

1.3 Introduction to molecular modeling tools ………………………………………….. 5

1.3.1 Comparative modeling/Homology modeling ……………………………………... 5

1.3.2 Homology modeling methodology in YASARA …………………………………. 9

1.3.3 Active site prediction …………………………………………………………..… 11

1.3.4 Molecular docking …………………………………………………………..… 14

1.3.4.1 Docking methods …………………………………………………………..… 16

1.3.4.2 Sampling function …………………………………………………………..… 16

1.3.4.3 Scoring functions …………………………………………………………..… 19

1.4 General protocol for docking ………………………………………………………… 22

1.4.1 Sampling and scoring functions used in the ORG ………………………………… 24

1.4.2 YASARA Energy Minimization …………………………………………………… 24

1.4.3 PRODRG ……………………………………………………………………… 24

1.4.4 YASARA implemented in AutoDock 4.0 …………………………………………. 25

1.4.5 Schrodinger’s GLIDE …………………………………………………………….. 27

1.5 Evaluation of docking results ………………………………………………………… 30

1.6 Summary …………………………………………………………………………… 31

Chapter 2: Designing small molecules to target the human A2A adenosine receptor …… 39

2.1 Introduction of guanine nucleotide-binding protein (G-protein) coupled receptors (GPCR) ………………………………………………………………………. 40

2.2 Types of adenosine receptors ………………………………………………………… 42

2.3 Implications of adenosine A2A receptors in cancer ….……………………..……… 45

2.4 Known small molecule therapies targeting the A2A receptor ……………………….. 51

2.5 Proposal to design a small molecule to target the A2A receptor ……………………….. 56

2.6 Understanding the structure and function of the human A2A receptor ……………… 58

2.7 Using molecular modeling tools for docking validation ……………………………. 65

2.7.1 POOL predictions give insight for previously unidentified binding residues ……. 65

2.7.2 Initial docking validation using YASARA AutoDock ………..………………….. 67

2.7.3 Initial invalid docking results in YASARA AutoDock ……………………………… 69

2.7.4 Optimization of structures: Using homology model to build in missing ECL2 loop

……………………………………………………………………………………... 71

2.7.5 Redocking of co-crystallized ligands into the homology models for AutoDock 4.0 docking validation …………………………………..……………………………………….....…... 73

2.7.6 Perform docking in GLIDE as a cross-docking validation for the homology model 3eml

………………………………………………………………………………………...... 80

2.8 Applying YASARA AutoDock and GLIDE docking to xanthine-based molecules … 82

2.8.1 Understanding interactions of KW-8002 to A2A adenosine receptor ……………….. 82

2.8.2 Applying YASARA AutoDock and GLIDE docking to methyl ester arylxanthines ..... 84

2.8.3 Applying YASARA AutoDock and GLIDE docking to methyl ester styrylxanthines 86

2.8.4 Results of best binders for PEG derivatives ………………….…………………… 89

2.9 Proposed synthesis of xanthine-based molecules …………………………….………… 91

2.10 Testing of synthesized molecules with in vitro functional assays …………………… 94

2.10.1 Measuring functionality of A2AAR antagonism by cAMP assay ……...…...... 94

2.10.2 Using the cytokine release assay of PEG-para reveals strong A2AAR antagonism than

KW-6002 …………………………………………………….……………………………. 96

2.11 Summary …………………………………………………………………….……….. 98

Part II – Developing the automated protein function method: Structurally Aligned Local

Sites of Activity (SALSA)

Chapter 3: Introduction of protein function annotation method: Structurally Aligned Local Sites of Activity (SALSA) ………………………………………………………………………... 107

3.1 Introduction of the Protein Structure Initiative (PSI) and structural genomics (SG)……. 108

3.2 Current computational methods to determine function of uncharacterized proteins …... 110

3.2.1 Sequence-based methods ……………………………..…………………………… 111

3.2.2 Structure-based methods ……………………………………………..…………… 112

3.3 Background of Structurally Aligned Local Sites of Activity (SALSA) ……………… 113

3.4 Procedure to run the SALSA analysis ………………………………………………. 115

3.4.1 Organizing the superfamily ……………………………………………………. 116

3.4.2 Homology modeling . …………………………………………………………….. 117

3.4.3 Incorporating the functional site predictor, POOL, in SALSA …………………… 117

3.4.4 Multiple structure alignments (MSA) of subclasses …………………………….. 118

3.4.5 Molecular of protein structures and sequences …………………… 122

3.4.6 Generating a SALSA table ……………………………………………………. 123

3.4.7 Generating consensus signatures (CS) …………………………………………… 123

3.4.8 Analysis of structural genomics (SG) proteins ………………………………….. 125

3.4.9 Scoring the SALSA table with Normalized Match to Consensus Signatures ……... 126

3.5 The development of the automated SALSA analysis: SALSA-DT …………………. 130

3.5.1 Automating the SALSA analysis ………………………………………………. 131

3.5.2 Background of graph theory …………………………………………………… 132

3.5.3 Application of Delaunay triangulation and graph representations to SALSA …… 133

3.5.4 Implementation of Delaunay triangulation …………………………………….. 134

3.6 Summary …………………………………………………………………………… 142

Chapter 4: Applying SALSA-DT to the Ribulose Phosphate Binding Barrel (RPBB) superfamily

……..…..………………………………………………………………………………… 150

4.1 Introduction of the Ribulose phosphate binding barrel superfamily …………………. 151

4.1.1 Understanding the classification of subclasses ………………………………... 152

4.1.2 Background of structure and function of enzymes in the RPBB superfamily …... . 155

4.1.2.1 Biosynthesis of tryptophan ………………………………………………….... 155

4.1.2.2 Biosynthesis of histidine …………………………………………………...... 161

4.1.2.3 Essential metabolic pathway enzymes …………………………………………… 168

4.1.3 Sequence identity between subclasses within the RPBB superfamily ………… 177

4.1.4 Identifying the structural genomics (SG) proteins in the RPBB superfamily ……. 179

4.2 Analysis of RPBB superfamily using SALSA ……………………………………… 182

4.2.1 Criteria for the dataset …………………………………………………………. 182

4.2.2 Building homology models for structures in the RPBB superfamily …..…………. 184

4.2.3 Identifying consensus signatures in the RPBB superfamily …………………… 185

4.2.3.1 Phosphoribosyl anthranilate isomerase (PRAI) ………………………………….. 187

4.2.3.2 Indole glycerol phosphate synthase (IGPS) ………………………………………. 188

4.2.3.3 Tryptophan synthase (TrpA) …………………………………………………….. 189

4.2.3.4 Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (HisA) and imidazoleglycerolphosphate synthase (HisF) …………………… 191

4.2.3.5 Ribulose phosphate epimerase (RPE) ………………………………………….. 192

4.2.3.6 Orotidine monophosphate decarboxylase (OMPDC) ……………………………. 194

4.2.3.7 Keto-gulonate-6-phosphate decarboxylase (KGPDC) ………………………. 195

4.2.3.8 Hexulose phosphate synthase (HPS) ……………………………………..…… 197

4.2.4 Using consensus signatures to analyze the RPBB superfamily ………………….. 199

4.2.4.1 HisA vs. PRIA: Different substrates, similar reaction mechanism ……………… 199

4.2.4.2 OMPDC vs KGPDC vs HPS vs RPE: Conserved spatial architecture of the active site ……………………………………………………………………………. 201

4.2.5 Scoring the SALSA table …………………………………………………… 211

4.2.6 Annotating structural genomics (SG) proteins using SALSA …………………. 214

4.3 Preliminary results in the analysis of RPBB superfamily using SALSA-DT ………… 226

4.3.1 Results sorting the subclasses …………………………………………………… 226

4.3.2 Annotating structural genomics proteins using SALSA-DT ………………….. 234

4.4 Summary ………………………………………………………………………….… 236

Chapter 5: Concluding remarks and future directions ………………….…………..……… 246

5.1 Concluding remarks …………………………………………………………………... 247

5.2 Future Directions ………………………………………………………………...... 249

Curriculum vitae ……………………………………………………………………………. 252

List of Figures

Chapter 1: Introduction to molecular modeling in drug discovery

Figure 1.1 Overview of timeline to obtain New Drug Application (NDA) approval .. 29

Figure 1.2 Preclinical steps for drug design ………………………………………….. 30

Figure 1.3 Overview of homology modeling methodology for application in drug discovery …………………………………………………………………………………..… 34

Figure 1.4 Safe homology modeling zone for sequence identity ……………………..... 35

Figure 1.5 Example of functional active site programs ………………………………... 38

Figure 1.6 Example of THEMATICS plot. …………………………………………. 40

Figure 1.7 POOL methodology ………………………………………………………… 41

Figure 1.8 Representation of the docking of inhibitor [I] in enzyme [E] resulting in complex [EI] …………………………………………………………………………… 42

Figure 1.9 Classification for protein-ligand docking ……………………………….. 44

Figure 1.10 Filtering system of GLIDE scoring. …………………………………….. 55

Chapter 2: Designing small molecules to target the human A2A receptor for hypoxia immunotherapies

Figure 2.1 Cartoon sketch of GPCR activation to GTP and GDP with a bound agonist … 68

Figure 2.2 Affinity of adenosine to adenosine receptors …………………………… 70

Figure 2.3 Overview of delayed negative feedback downregulation in inflamed local tissue environment …………………………………………………………………………… 73

Figure 2.4 Activation or deactivation of A2A signaling by adenosine, A2A antagonists or genetic deletion of A2A receptors ………………………………………………………… 75

Figure 2.5 Comparison of healthy tissue verses tumor tissue ……………………….. 76

Figure 2.6 The natural agonist adenosine and the non-selective antagonist caffeine ……... 78

Figure 2.7 Structures of known agonists targeting the A2AAR receptor ………………. 81

Figure 2.8 Structures of known antagonists targeting the A2AAR receptor ……………… 82

Figure 2.9 Proposed design of bioconjugate …………………………………………… 84

Figure 2.10 Cartoon view of the PDB 3eml crystal structure ……………………………. 86

Figure 2.11 Close-up view of active site residues for PDB 3eml …………………….…. 88

Figure 2.12 Structurally aligned active site of PDB 3eml, 2yz0 and 2yzv …………….. 89

Figure 2.13 Superposition of the crystal structures of the human A2AARs, 3eml co-crystallized with ZM241385 and 3qak co-crystallized with UK-432097 …….…………………..… 91

Figure 2.14 The docked ligand ZM241385 in the 3eml crystal structure has an invalid pose 96

Figure 2.15 Docked UK-432097 compound in the 3qak crystal structure ………..……. 98

Figure 2.16 Ramachandran plot of homology models for 3eml and 3qak with missing loops built in ……………………………………………………………………………….……. 101

Figure 2.17 Docking validation of the 3eml homology model in YASARA AutoDock 4.0 103

Figure 2.18 Comparing the RMSD of atoms between the docked ligand in the crystal structure and the homology model 3eml …………………………………………………………….… 104

Figure 2.19 Docking validation of the 3qak homology model in YASARA’s AutoDock 105

Figure 2.20 Comparing the RMSD of atoms between the docked ligand in the crystal structure and homology model of 3qak …………………………………….……………………...… 106

Figure 2.21 Comparison of AutoDock and GLIDE docking results of ZM241385 …….. 108

Figure 2.22 The antagonist, KW-6002, docked into the 3eml homology model …………. 110

Figure 2.23 Molecules of the ortho-, meta-, and para- methyl ester arylxanthines ...… ... 111

Figure 2.24 Docking results of the ortho-, meta-, and para- methyl ester arylxanthines ….. 112

Figure 2.25 Molecules of the ortho-, meta-, and para- methyl ester styrylxanthines …….. 114

Figure 2.26 Docking results of ortho-, meta-, and para- methyl ester stryrylxanthines ….. 115

Figure 2.27 The styrylxanthine PEG-analog selected for additional docking studies …….. 116

Figure 2.28 Models of PEG-para and PEG-meta stryrylxanthines derivatives …………. 117

Figure 2.29 Initial synthesis of diaminouracil …………….……………………..……… 119

Figure 2.30 Synthesis of styrylxanthine-PEG conjugate …….…………………………. .. 120

Figure 2.32 cAMP functional assay to measure the functional effects of A2AAR signaling 122

Figure 2.32 Cytokine secretion assay to measure the functional effects of A2AAR signaling 124

Figure 2.33 The lead compounds, PEG-para-8-mer ……………………………………… 125

Chapter 3: Introduction of protein function annotation method: Structurally Aligned Local Sites of Activity (SALSA)

Figure 3.1 Methods used to identify the function of a protein using sequence or structural information ………………………………………………………………………...…….…. 110

Figure 3.2 Overview of the SALSA methodology ……………………………….……... 114

Figure 3.3 Overview of the process for creating a MSA of the superfamily ……..….. 120

Figure 3.4 Steps for the automated program Generating Local Alignment Table (GLAT) 121

Figure 3.5 A hypothetical example of a generated SALSA table ………………….……. 123

Figure 3.6 A hypothetical example of consensus signatures in a SALSA table ……....… 124

Figure 3.7 An example table of consensus signatures ………………………………… 125

Figure 3.8 PAM250 Matrix …………………………………………………………..…. 127

Figure 3.9 BLOSUM62 Matrix ………………………………………………………… 128

Figure 3.10 Chemical similarity matrix prepared by the ORG ………………………….… 129

Figure 3.11 Overview of Generating Scoring Matrix (GSM) program …….……………. 130

Figure 3.12 Line drawings of Graph A ………...…………………………………….. 132

Figure 3.13 Simplified example of Delaunay Triangulation for a 2D protein ..……….. 134

Figure 3.14 Overview of SALSA-DT for matching pairs of proteins …………………... 134

Figure 3.15 An example of Delaunay triangulation for protein structures in the preprocessing step …………………………………………………………………….……………….. 136

Figure 3.16 The overview of pairwise matching, described in detail ………………….. 137

Figure 3.17 An example of seed pair matching in 2D …...………………………………. 138

Figure 3.18 Example of seed pair matching by chemical similarity ………………….. 139

Figure 3.19 The lengths of the edges are included in the seed pair matching …...……….. 140

Chapter 4: Applying SALSA and SALSA-DT to the Ribulose Phosphate Binding Barrel (RPBB) superfamily

Figure 4.1 Cartoon sketch of the overall shape of members from the RPBB superfamily 151

Figure 4.2 Simple overview of the various reactions found in the RPBB superfamily 154

Figure 4.3 Overall structure similarity found between the IGPS, PRAI and TrpA subclasses ………………………………………………………………………………… 156

Figure 4.4 The general mechanism for phosphoribosyl anthranilate isomerate (PRAI) .. 159

Figure 4.5 The general mechanism for indole glycerol phosphate synthase (IGPS) ……. 160

Figure 4.6 The general mechanism for tryptophan synthase (TrpA) …………………... 162

Figure 4.7 Overall structure difference found between HisF in yeast and bacteria ……. 164

Figure 4.8 Part I of the general proposed mechanism for HisA/HisF …………………... 165

Figure 4.9 Part II of the general proposed mechanism for HisA/HisF …………………… 166

Figure 4.10 The general mechanism for ribulose-phosphate epimerase (RPE) ………..... 170

Figure 4.11 The general mechanism for orotidine-5’-monophosphate decarboxylase

(OMPDC) …………………………………………………………………………………. 173

Figure 4.12 The general mechanism for keto-3-gulonate-6-phosphate decarboxylase

(KGPDC) …………………………………………………………………………………. 174

Figure 4.13 The general mechanism for hexulose phosphate synthase (HPS) …………. 176

Figure 4.14 An example of a structural genomics protein page found in the PDB …….. 180

Figure 4.15 The active site of PRAI with bound ligand CdRP in light purple …………. 187

Figure 4.16 The active site of IGPS with bound ligand IGP in green …………………… 189

Figure 4.17 The active site of TrpA with bound ligand 1GP in gray …………………… 190

Figure 4.18 The active site of HisA/HisF with bound ligand PRFAR in white …………. 191

Figure 4.19 The active site of RPE with bound ligand D-xylitol 5-phosphate in white ….. 193

Figure 4.20 The active site of OMPDC with bound ligand OMP in cyan ……………… 194

Figure 4.21 The active site of KGPDC with bound ligand ribulose-5-phosphate in green 196

Figure 4.22 The active site of HPS …………………………………………………….. 198

Figure 4.23 The substrates and products from the enzymes PRAI and HisA …………. 200

Figure 4.24 Overlay of active site residues for PRAI and HisA ……………………….. 200

Figure 4.25 The reaction mechanisms for OMPDC, KGPDC, HPS and RPE ………….. 203

Figure 4.26 Aligned active sites of OMPDC, KGPDC and HPS ……………………..... 205

Figure 4.27 Aligned active sites of OMPDC and KGPDC …………………………… 207

Figure 4.28 Aligned active sites of KGPDC and HPS ………………………………... 209

Figure 4.29 Aligned active sites of KGPDC and RPE ………………………………… 210

Figure 4.30 The four correctly annotated SG proteins in the IGPS subclass …………….. 218

Figure 4.31 The five SG proteins that matched the TrpA subclass ………………….. 220

Figure 4.32 The five SG proteins that matched the HisA/HisF subclass ……………… 221

Figure 4.33 The four SG proteins that matched the RPE subclass ………………………. 222

Figure 4.34 The seven SG proteins that matched the OMPDC subclass …………….. 224

Figure 4.35 The one SG protein that matched the KGPDC subclass …………………. 225

List of Tables

Chapter 1: Introduction to molecular modeling in drug discovery

Table 1.1 List of programs that use various methods and algorithms for sampling and scoring functions ………………………………………………………………………………… 46

Chapter 2: Designing small molecules to target the human A2A receptor for hypoxia immunotherapies

Table 2.1 Tissue distribution, modulated systems and disease targets for each subtype of adenosine receptors ………………………………………………………………….…… 71

Table 2.2 Functional assay data from agonists and antagonists ………………………. 80

Table 2.3 Overview of interacting residues of ZM-241385 …………………………… 87

Table 2.4 POOL predictions from PDB 3eml and 3qak …………………………… 93

Table 2.5 Simulation cell parameters for YASARA AutoDock ………………………. 95

Table 2.6 Resulting Z-scores of homology model ……………………………………… 99

Chapter 3: Introduction of protein function annotation method: Structurally Aligned Local Sites of Activity (SALSA)

Table 3.1 Overview of the three phases of the Protein Structure Initiative ……….. .. 109

Table 3.2 Names of the POOL output files using the POOL program ……………… 118

Chapter 4: Applying SALSA and SALSA-DT to the Ribulose Phosphate Binding Barrel (RPBB) superfamily

Table 4.1 Overview of enzymes found in the RPBB superfamily …………………… 152

Table 4.2 List of protein structures available in the PDB with structural information 153

Table 4.3 The reported catalytic residues from the CSA for the subclasses

PRIA, IGPS and TrpA ……………………………………………………………….… 158

Table 4.4 The reported catalytic residues from the CSA for the subclass HisA/HisF ... 167

Table 4.5 The reported catalytic residues from the CSA for the subclasses

OMPDC, RPE, KGPDC and HPS ………………………………………………………... 171

Table 4.6 Sequence identity matrix using the multiple sequence alignment program Clustal Omega ……….………………………………………………………… 178

Table 4.7 The list of structural genomics proteins gather from the PDB that have a sequence, keyword match or structural similarity to proteins found in the RPBB superfamily ....… 181

Table 4.8 Reported methods used for the POOL calculations of each protein ………... 183

Table 4.9 The PDB listing of structures that required building a homology model …… 185

Table 4.10 Spatially aligned consensus signatures for each subclass ……………….… 186

Table 4.11 Comparison of consensus signatures between the OMPDC,

KGPDC, HPS and RPE subclasses ………………………………………………………… 206

Table 4.12 The consensus signatures for each subclass compared to CSA reported catalytic residues ……………………………………………………………………………….…...... 212

Table 4.13 The MCS for the RPBB SALSA table ……………………………………... 213

Table 4.14 Structurally aligned consensus signatures (in blue) of structural genomics proteins to subclass representatives ……………………………………………………………… 215

Table 4.15 The MCS and sequence identity scores for the SG proteins ……..…….… 216

Table 4.16 The raw classification of proteins from subclasses with grouped POOL alignments

………………………………………………………………………….……………….. 229

Table 4.17 Table 4.17 The organized and detailed analysis from SALSA-DT for IGPS,

TrpA, PRIA and HisA/HisF subclasses ……………………………………….…………… 230

Table 4.18 The organized and detailed analysis from SALSA-DT for the RPE, OMPDC,

KGPDC and HPS subclasses ………………………………..……………….…………… 259

Table 4.19 Results for clustering the SG proteins ………..……………….…………… 235

List of Abbreviations

Abbreviation Meaning 2D Two-dimensional 3D Three-dimensional ºC degrees Celsius Å Angstroms A3H6P D-arabino-3-hexulose 6-phosphate

A2AAR subtype A2 adenosine receptor

-/- A2AR A2A receptor gene-deficient AC adenylate cyclase

Adora2a A2A adenosine receptor gene AICAR 5-aminoimidazole-4-carboxamide ribonucleotide AUC Area Under the Curve ATP adenosine triphosphate B- general base BBB blood-brain barrier BCT bicyclic triazolotriazine BDMS Borodimethylsulfonium bromide BLOSUM BLOcks SUbstitution Matrix Ca2+ calcium ion cAMP cyclic adenosine monophosphate CE Combinatorial Extension CE-MC Combinatorial Extension and Monte Carlo

CGS21680 selective A2A adenosine receptor agonist CNS central nervous system ConA Concanavalin-A CSC 8-(m-chlorophenylbutadienyl) caffeine CSGID Center for Structural Genomics of Infectious Disease

CSM Chemical Similarity Matrix CS consensus signatures CSA Catalytic Site Atlas KW-6002 Istradefylline DCM Dichloromethane DMAP 4-Dimethylaminopyridine DMF Dimethylforamide DMSO Dimethylsulfoxide E. coli Escherichia coli EDCI 1-ethyl-3-(3-dimethylaminopropyl) carbodimide EFEB estimated final energy of binding ELISA enzyme-linked immunosorbent assay to measure cytokine levels ET Evolutionary Trace FDA Food and Drug Administration FDE final docking energy G3P glyceraldehyde-3-phosphate GDP guanosine diphosphate GLAT General Local Alignment Table G-protein guanine nucleotide-binding protein GPCR G-protein coupled receptor GSM Generating Scoring Matrices GTP guanosine triphosphate H-A general acid H-H Henderson-Hasselbach hERG Human ether-a-go-go related gene HisA phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase HisF imidazoleglycerolphosphate synthase HPLC high performance liquid chromatography

HPS hexulose phosphate synthase JCSG Joint Center for Structural Genomics I3P indole-3-phosphate ICL intracellular loops IFNγ cytokine interferon-gamma IGPS indole-3-glycerol phosphate synthase IL-4 cytokine interleukin-4 ImGP imidazole glycerol phosphate ImGPS imidazole glycerol phosphate synthase INTREPID Information-theoretic Tree Traversal for Protein Functional Site Identification LPC/CSU Ligand Ligand-Protein Contacts & Contacts of Structural Units 3KG6P 3-keto-L-gulonate 6-phosphate K+ potassium ion kcat First order rate constant kDa kilo- KO knockout KGPDC keto-3-gulonate-6-phosphate decarboxylase MACiE Mechanism, Annotation and Classification in Enzymes MC Monte Carlo MCSG Midwest Center for Structural Genomics MCS normalized Match to Consensus Signatures MD Mg2+ magnesium ion MSA Multiple Structure Alignment MW molecular weight NAP number of aligned positions NDA New Drug Application NECA 5’-ethyluronamide

NIGMS National Institute of General Medcial Sciences nM nanoMolar nm nanometer NMR Nuclear Magnetic Resonance NSGC Northeast Structural Genomics Consortium OMP orotidine-5’-monosphate OMPDC orotidine-5’-monophosphate decarboxylase PAM point accepted mutation PD Parkinson’s Disease PDB protein data bank PEG poly(ethylene) glycol PEG-Me PEG methyl ethers PGA Polyglutamic acid PLC phospholipase PME Partial-mesh Ewald POOL Partial Order Optimum Likelihood PMP Protein Model Portal PRA N-(5’-phosphoribosyl) anthranilate PRAI phosphoribosyl anthranilate isomerase PROFAR N’[(5’-phosphoribosyl)-formimino]-5-aminoimidazol-4-carboxamid ribonucleotide PRFAR N-(5’-phospho-D-1’-ribulosylformimino)-5-amino-1-(5’-phosphoribosyl)-4- imidazole carboxamide PSI Protein Structure Initative PSI-BLAST Position-Specific Iterative – Basic Local Alignment Search Tool PSSM Position-Specific Scoring Matrix Qhull QuickHull QP query protein R5P ribulose-5-phosphate

RMSD root-mean-square-deviation RPBB Ribulose-phosphate binding barrel RPE ribulose phosphate epimerase RuMP ribulose monophosphate pathway RS residue score for each aligned position SALSA Structurally Aligned Local Sites of Activity SALSA-DT Structurally Aligned Local Sites of Activity – Delaunay Triangulation SCOP Structural Classification of Proteins SFLD Structure Function Linkage Database SG structural genomics SGC Structural Genomics Consortium SS similarity score T-COFFEE Tree-based Consistency Objective Function for alignment Evaluation TCR T-cell receptor TDLN Tumor-draining lymph nodes TEA Triethylamine THEMATICS THEoretical Microscopic Anomalous TItration Curve Shapes THF Tetrahydrofuran T. maritima Thermatoga martiia TLC Thin layer chromatography TM transmembrane TrpA Tryptophan Synthase alpha chain/alpha subunit TNF-α cytokine tumor necrosis factor alpha UK-432097 2-(3-[1-(pyridin-2-yl)piperidin-4-yl]ureido)ethyl-6-N-(2,2-diphenylethyl)-5′- N-ethylcarboxamidoadenosine-2-carboxamide μg microgram μL microliter μm micrometer μM micromolar

UMP uridine monophosphate VAST Vector Alignment Search Tool vdW van der Waals VMD Visual-Molecular Dynamics VEH vehicle X5P xylulose-5-phosphate YASARA Yet Another Scientific Artifical Reality Application ZM-241385 4-(2-[7-amino-2-(2-furyl)-[1,2,4]triazolo-[2,3-a][1,3,5]triazin-5 ylamino]ethyl)-phenol

Chapter 1

Introduction to molecular modeling in ligand design

.

1.1 Overview of current molecular modeling tools in drug discovery

Traditionally, the drug discovery process consisted of medicinal chemists synthesizing molecules and performing high-throughput screening assays to identify the biologically active compounds.

In the post-genomic era, drug discovery is driven by genetic studies, animal models, molecular biology, protein science, information technology, fast computers and computational techniques.1

The advantage of these methods is that they enable delivery of new drug candidates efficiently, in terms of cost, materials and time. In Figure 1.1, the time scale for drug approval in the United

States from preclinical to New Drug Application (NDA) approval spans from 3.2 years up to 20 years, with an average of 8.5 years.2 The average cost to develop a drug including the cost of failures in the 2000s was between $1.5 billion to more than $1.8 billion.3,4

Figure 1.1 Overview of timeline to obtain New Drug Application (NDA) approval.

Computational techniques use theoretical methods to mimic or model the behavior of molecules.

Common techniques used in drug discovery include visualization, molecular mechanics, molecular dynamics simulation, comparative modeling and docking.5 These tools allow synthetic chemists to prioritize which compounds to synthesize and thus accelerate the drug discovery process in the preclinical stage. The steps for rational drug design (Figure 1.2) during the preclinical state include: disease target identification and validation, chemical hits identification and lead optimization for the development of a drug candidate.6,7 A disease is linked to a malfunctioning gene or protein, is tested and confirmed through validation studies to understand the involvement in disease. A molecule can be identified as active and further modified to find a drug candidate. The processes that computational techniques can aid in rational drug design are: target identification, chemical hits and lead compound optimization.

Figure 1.2 Preclinical steps for drug design. Identification of a disease target to the development of a drug candidate takes many resources. During the chemical hits process, molecular modeling techniques can be used for in silico screening.

Structural information from methods such as x-ray crystallography, nuclear magnetic resonance

(NMR) or homology modeling is the first step for studying disease targets. It is estimated that 50% of early drug discovery projects start without a structure model; thus a homology model can be built if structural data of homologues are available.8,9 Programs known as active site predictors use structural and sequence information to predict the binding sites, which aid in molecular docking. Molecular docking is a very important tool for drug discovery when screening molecules to predict the binding affinities of molecules; these molecules can be further optimized for a lead compound.10,11 Molecular modeling visualization allows drug developers to

“see” the important interactions between a bound ligand and the protein target. The current trends in technology, genomics and open databases will allow for the even more efficient development of new drugs.

1.2 Computationally guided ligand design

This strategy, computational guided ligand design, accelerates the drug discovery process by incorporating molecular modeling techniques but also empirical information for developing a drug candidate. Knowledge from 3D protein structures, identifying the active site, understanding natural substrate binding, mutagenesis data, favorable drug-like properties and additives for modifications will contribute to the design of an ideal lead compound to be synthesized.12

Modification of molecules can be screening in silico with molecular docking programs.

Understanding the surrounding chemistry of amino acids in a binding pocket and necessary interactions for proper binding of the ligand to the protein will provide a map to design and test compounds. Predicting the interaction using a score or ranking system then predicts the best compounds.6 Crystal structures of known natural substrates, antagonists and agonists provide a start for designing molecules and are used to begin docking studies. The positions of key features from known ligands yield information about possible hydrogen-bonding or hydrophobic contacts open up possible interactions to explore for designing new leads.13 Combining these methods provides advantages to speed up the preclinical process.

1.3 Introduction to molecular modeling tools

In this section, the molecular modeling tools used for computationally guided ligand design projects will be discussed.

1.3.1 Comparative modeling/Homology modeling

The number of non-redundant amino acid sequence entries in the UniProtKB/Swiss-Prot14 is

542,258. The number of available 3D protein structures in the Protein Data Bank15 is 97,980 (as of February 2014), 86,690 are x-ray crystal structures and 10,246 are solution NMR structures.

The gap between annotated sequences and available structures is large. The bottleneck in solving protein structures is primarily due to purification and crystallization methods. As an alternative, comparative modeling also known as homology modeling, is used to predict protein structure from sequence for the proteins with limited or unknown structure. The goal of homology modeling is to predict the structure from its sequence with as high accuracy as experimental methods can achieve.8 The SWISS-MODEL Protein Model Portal (PMP)16 and Modbase17 are two repositories that contain publically available models. Homology modeling relies on the idea that proteins with similar sequences have similar structures. High-resolution structures will yield better models to aid in accounting for loop flexibility and the correct spatial positions of amino acid residue side-chains. This may not be true for all cases but if the sequence across all areas of the protein (coverage) is highly similar, it can overcome the issue of incorrect structure prediction. The methodology of building comparative protein structures (Figure 1.3) consists of six general steps: 1) a target sequence must be identified; this is the protein with unknown 3D structure. 2)

Known 3D structures of a homologous protein or proteins are found to serve as templates. 3) The two sequences of the target and template are aligned to find proper coverage and sequence identity. 4) The model is built to establish the overall tertiary structure and to find proper positions of side-chains and loops. 5) The model undergoes refinement and 6) the model is evaluated using assessment tools in the SWISS-MODEL workspace18, in example, Q-MEAN19,

ANOLEA20, ProQres21 and PROCHECK22.

The quality of the homology models depends on the level of sequence identity between the known 3D structure (template) and the target.23 It has been shown that models can be generated for a target that shares a sequence identity of 30% or greater to a template sequence; it has been assumed the two sequences have a common ancestor.24 The idea being evolutionarily, structures are more stable and change over a slower period of time verses mutations in the sequence; thus the distantly-related sequences will fold into similar structures.8 Low sequence identity, below

15%, will not yield a good model with the current alignment methods. The general rule for sequence identity cut-off is if there is a 30% sequence identity, a quality model can be produced and used for computational drug design and assignment of protein function (Figure 1.4).24 The next step is model building where the target protein is constructed by substitution/insertion/deletion of amino acids based on the sequence alignment. Side-chains of atoms and loops are modified based on spatial position.

Figure 1.3 Overview of homology modeling methodology for applications in drug discovery.

Figure 1.4 Safe homology modeling zone for sequence identity. If the target sequence similarity falls in the safe homology zone (above the blue line), then the model may be produced with confidence. For example if the percentage of identical residues is 35%, there needs to be over

100 identically aligned residues.

The best model is pushed forward for model refinement of the protein backbone and side-chains rotamers to avoid steric clashes. The model assessment process checks the stereochemical quality and accuracy of the model compared to high resolution experimental structures.

Stereochemistry and conformation of the side-chains and backbone are corrected through energy minimization of force-field programs. As a test to measure the accuracy of the structure prediction method, models are built for sequences of known structure and the model is compared to the actual experimental structure by measuring the distance of the backbone Cα atoms. The distance RMSD (root-mean-square deviation) can fall between ~1-2 Å RMSD for use in molecular docking.8

1.3.2 Homology modeling methodology in YASARA

YASARA8 was the platform to build homology model structures and the description below is adapted from their protocol.25 The target in YASARA is the protein whose structure you want to build a model from a known sequence. The template is a protein with known structure that has similar sequence to the target structure to serve as a model for building. The model is the end result generated from the YASARA program. The structure is created from the template(s) by changing loops and side-chains. The input for a target sequence in YASARA is in a FASTA format. A continuous sequence indicates a monomer but a vertical bar “|” in between the sequence will indicate a separation of the molecule. This is important if the molecular is an oligomer. If a target sequence contains an unusual amino acid, YASARA will replace the sequence with a template amino acid at that position.

The macros (.mcr) available in YASARA is hm_build.mcr or hm_buildfast.mcr. The hm_build.mcr file can be modified to specific parameters. The hm_build macro begins by running the Basic Local Alignment and Search Tool (BLAST)26 to find related sequences in

UniRef90 (SwissProt and TrEMBL)14 to construct a position-specific scoring matrix

(PSSM).27,28 The PSSM is used to search the PDB15 for potential modeling templates. The number of BLAST iterations performed and BLAST expect value (E-value) parameter for selection of templates can be modified. Templates will be selected and ranked based on the alignment score and the structural quality (resolution) according to WHAT_CHECK.29 The structural quality of templates used to build models ideally should be of high-resolution x-ray crystal structures; these best structures give more accurate models than lower resolution crystal structures and NMR structures. There is an option to provide your own templates for building the model. This can occur if coverage of certain areas of the protein sequence are needed to build the

model. The number of PDB structures used to build the model can be modified in the macro. If the size of the target may large, the MD simulation refinement for model building may take a long time. The macro hm_buildfast.mcr can be used in this situation, in order to reduce the number of models to be built. The macro limits the number of templates and alignments to search and also reduces the sampling of loops and side-chains. In the macro, the LoopsSamples parameter can be adjusted to find more conformations of the loops and optimize the side-chains.

The side-chain rotamers are selected using the SCWALL method, Side-Chain conformation With

ALL available methods. The SCWALL method uses initial side-chain conformation (Phi and Psi dihedral angles from the backbone empirical probabilities for the rotamers) and then includes van der Waals (vdW) and electrostatic interactions. The optimization incorporates YASARA2 force field and solvation energies to perform a steepest decent minimization on the rotamers.30

The program will run with animation and generate a detailed modeling report in html format.

The model qualities are displayed with images of the models, per-residue quality plots and

YASARA’s model quality score. YASARA uses a structure validation tool, CHECK.29 CHECK checks 1) the normality of bond lengths, bond angles, dihedrals, planarity, non-bonded (vdW &

Coulomb interactions) energies according to the force field used; and 2) the packing1D, is the 1D distance-dependent packing interactions in the YASARA2 force field and packing3D, is the 3D direction-dependent packing interactions in the YASARA2 force field. The model quality will range from ‘optimal’ to ‘good’ to ‘terrible.’ The per-residue quality Z-score is a numerical value.

The Z-score is the average dihedral, packing1D and Packing 3D Z-score, multiplied with -20. A

B-factor of 0 indicates a Z-score of 0, yielding a perfect quality. A B-factor of 100 indicates a bad quality and Z-score of 5. In the per-residue plots, the B-factors between 0 - 75 are blue to yellow. Blue is perfect and yellow is bad. For the Z-score, the YASARA website lists the

equation to calculate how many standard deviations the model is from the average gives a 'Z- score':

푥 − 휇 푍 = 휎 YASARA uses the formula above, with x being the raw value of the model, μ is the average value of the gold standard population and σ is the standard deviation of the gold standard population.

1.3.3 Active site prediction

For gaining important information about rational design of a small molecule to interact with a target identified, knowledge of the active site provides valuable information. Studying the interactions of the protein-ligand complexes provides important information about the biochemical function of an enzyme. The important residues for interactions are either ligand recognition/binding or involved in catalysis; these interacting residues may be predicted using

3D structure information (Figure 1.5).

Figure 1.5 Example of functional active site programs. The input for functional site predictors, via an algorithm or program, is a 3D structure. The program predicts the active site residues and generates a list residues or 3D image for the location of the residues.

Using functional site predictors, Theoretical microscopic anomalous titration curve shapes

(THEMATICS)31 and Partial order optimum likelihood (POOL)32, the important active site residues obtained using 3D structural and evolutionary input features will aid in understanding the active site. Using the active site prediction programs THEMATICS and POOL, developed in the Ondrechen Research Group (ORG) at Northeastern University, only a 3D protein structure is needed to find the functionally important active site residues. THEMATICS calculates the electrical potential of the 3D protein structure using a finite difference, linear Poisson-Boltzmann method. A hybrid method33 is used to calculate theoretical titration curves for all ionizable residues (Arg, Asp, Cys, Glu, His, Lys, and Tyr, plus the N- and C- termini). These titration curves illustrate the proton occupation functions for each ionizable residue as a function of pH.

The shapes of these curves are evaluated with statistical analysis to quantify a deviation value from Henderson-Hasselbach (H-H) titration behavior; this identifies the residues that do not follow a typical H-H curve shape (Figure 1.6). THEMATICS has been shown to yield selective, highly localized predictions of catalytic and binding residues.34

Because ionizable residues make up only 76% of all known catalytic residues35, the ORG developed POOL to predict all 20 amino acid residue types. POOL is a new monotonicity- constrained, maximum likelihood machine learning method. It takes the THEMATICS metrics for a residue and assigns to each residue a value proportional to the probability that the residue important for the biochemical function. POOL also defines environment variables for all residues. The environment variables measure the deviations from normal titration behavior for the ionizable residues in the immediate, local environment of every residue in the structure.

These environment variables enable the prediction of all residue types in the active site,

including the hydrophobic residues. POOL is flexible in that it can utilize any input feature upon which the probability of functional importance depends monotonically (Figure 1.7).

Figure 1.6 Example of THEMATICS plot. From tryptophan synthase from Salmonella enterica subsp. enterica serovar Typhimurium (PDB 1KFK), the active site residues (in green/bold)

ASP38, ASP46 and ASP124 are THEMATICS predicted. The remaining residues are non-active site predicted residues.

The POOL method incorporates phylogentic tree information from the Information-theoretic

Tree Traversal for Protein Functional Site Identification (INTREPID) scores.36 INTREPID computes the importance of each position within all subtrees containing the sequence of interest; it suffices for a position to become conserved within some subtree for it to be detectable.36

Utilizing the output of these two methods, the functional active site residues can be found using only the 3D structure as input.

Figure 1.7 POOL methodology.37 POOL incorporates electrostatic properties, pocket geometric information and phylogenetic information to train the dataset (Catalytic Site Atlas38). The machine learning method will then estimate the probabilities that specific residues belong to an interaction site.

1.3.4 Molecular docking

The 3D structure of a protein gives information about how the protein functions and interacts with ligands. A ligand is a molecule that binds reversibly to a protein. Ligands can be lipids, small molecules, DNA, RNA, peptides, or small proteins. The idea of “lock-and-key” interaction provides a simplified model for the interaction between a ligand and protein, but does not take into account that proteins are flexible. The binding of a protein and ligand creates a conformational change to permit binding; this is called induced fit.39

Molecular docking programs model the interaction of the ligand-protein complex. Specifically the prediction of the ligand conformation and orientation within a binding site.11 Docking typically uses 3D structures, found in the PDB15, that are solved by x-ray crystallographic or

NMR methods. A docking method is comprised of two components, sampling and scoring.

Sampling is the generation of poses, the orientation/pose and conformation of the small molecule, in the binding site. The pose is the conformation and orientation of a ligand in the active site. The ligand contains many conformational degrees of freedom that need to be accurately measured for the best fit into the binding site.10 More sampling methods can be classified because of protein flexibility.40 Scoring is an evaluation of the predicted binding energy between the ligand and protein using an empirical energy function. Most empirical energy functions in docking are based on the free energy of binding (ΔG) of an enzyme [E] and inhibitor [I] in complex [E+I] (Figure 1.8).11

Figure 1.8 Representation of the docking of inhibitor [I] in enzyme [E] resulting in complex [EI].

The correct complex structure under equilibrium condition follows Eq.1.1. From this, the energy of the complex (pose in docking) of [E+I], is used to calculate the binding affinity with Eq. 1.2 and 1.3.11

Calculating the binding affinity. The pose with the lowest energy score is predicted as the best match. Sampling and scoring functions should be fast and accurate.

[퐸퐼] ↔ [퐸] + [퐼] (Eq. 1.1)

∆퐺 = −푅푇푙푛퐾퐴 (Eq. 1.2)

−1 [퐸퐼] 퐾 = 퐾 = (Eq. 1.3) 퐴 푖 [퐸][퐼]

1.3.4.1 Docking Methods

Small molecule docking methods break down into two parts: a sampling function and scoring function (Figure 1.9). Current sampling methods focus on protein flexibility or ligand flexibility or flexibility in both the protein and ligand.10

1.3.4.2 Sampling function

Current methods for protein flexibility are grouped into four categories: soft docking, side-chain flexibility, molecular relaxation and protein ensemble docking.41 Protein flexibility allows the binding site flexibility to sample a smaller area versus the whole structure. Soft docking focuses on the overlap region between the ligand and protein based on interatomic van der Waals interactions and softening the calculation. Side-chain flexibility keeps the backbone atoms fixed and the side-chains are sampled. In molecular relaxation, the protein begins as a rigid-body and the ligand is placed into the binding site. The protein backbone and side-chains within the binding site are then allowed to relax using molecular dynamics simulations to minimize the complex. This method is more time-consuming compared to soft docking and side-chain

flexibility. Ensemble docking uses multiple protein structures to represent the possible conformational changes the protein undergoes. The protein structures are generated using molecular dynamics and Monte Carlo simulations. Common programs for these methods are found in Table 1.1.

Figure 1.9 Classification for protein-ligand docking.40,41 The sampling function breaks down into two groups, incorporating protein flexibility or ligand conformations/sampling. Scoring functions revolve around the types of algorithms applied used.

The second type of sampling is ligand sampling and it is broken down into three major groups: shape matching, systematic search methods, stochastic algorithms.40,41 The sampling methods for ligand sampling look specifically at the ligand orientation and possible conformations around the binding site. Shape matching represents the ligand as its molecular surface and places the ligand

in the binding site, seeking shape complementarity. The possible ligand binding orientations allowed are based on the six degrees of freedom, three translational and three rotational.

Systematic search methods generate all possible ligand binding conformations by exploring all degrees of freedom of the ligand, giving an exponential amount of possible conformations. There are two types of systematic search methods: exhaustive and fragmentation search. Exhaustive search investigates all possible rotatable bonds of the ligand. Due to the numerous combinations, the ligands can be filtered by setting geometric constraints to ease computational needs.

Fragmentation methods divide the ligand into fragments in which portions remain rigid while others are sampled. Each fragment is placed into the binding site and sampled, eventually building the best conformation one step at a time in the binding site. The last type of sampling is the stochastic algorithm. The ligand is put in the binding site with a random conformation, translation and rotation, it is then changed randomly. There are four types of stochastic methods used in current programs: Monte Carlo (MC), evolutionary/genetic algorithms42, Tabu search methods43 and swarm optimization (SO)44. The MC method calculates the probability to accept a random change based on the Boltzmann probability function, where E0 and E1 stand for the energy score of the ligand before and after the random change:

−(퐸 −퐸 ) ( 1 0 ) 푃 ~ 푒 푘퐵푇 (Eq. 1.4)

The other constants are kB for the Boltzmann constant and T is the Kelvin temperature. If the score is better than the previous one, the conformation is accepted. The cycle continues.

Evolutionary and genetic algorithms search for a ligand binding mode based on an evolutionary process in biological systems. The idea of fitness Tabu search looks at a current conformation and makes small random changes in the space around the ligand; the changes are ranked and

rejected by cut-off values implemented. Swarm optimization has a space that the ligand moves within. The space is modeled using swarm intelligence and the best pose is selected among interactions with neighboring positions.

Table 1.1 List of programs that use various methods and algorithms in their sampling and scoring functions.11,40,41

Protein flexibility Soft docking GLIDE Side-chain flexibility GLIDE Molecular relaxation MD Monte Carlo Protein ensemble GLIDE DOCK Ligand Sampling Shape matching DOCK FRED FLOG LigandFit MS-DOCK Surflex Systematic search DOCK FlexX GLIDE Hammerhead FLOG Stochastic algorithms AutoDock MOE-Dock GOLD PRO_LEADS Scoring functions Force-field AMBER CHARMM Empirical LUDI F-Score ChemScore SCORE Fresno X-SCORE Knowledge-based PMF DrugScore SMoG

1.3.4.3 Scoring Functions

The predicted conformations of the ligand from the sampling algorithms are tested for quality using a scoring function.11 The correct conformation of the ligand with the protein will be evaluated and ranked against other conformations. Reliable scoring functions make assumptions and use empirical information for predicting binding affinity. There are three types of scoring functions: force-field based, empirical and knowledge-based. The fourth is consensus scoring that uses multiple functions.

The basic functional form of a force field encapsulates both bonded terms relating to atoms that are linked by covalent bonds, and non-bonded describing the long-range electrostatic and van der

Waals forces.11 A general form for the total energy in an additive force field can be written as:

Etotal = Ebonded + Enonbonded (Eq. 1.5)

Ebonded = Ebond + Eangle + Edihedral (Eq. 1.6)

Enonbonded = Eelectrostatic + Evan der Waals (Eq. 1.7)

The force field calculates the total interaction energy of the ligand-protein and interaction energy of the ligand. The form for macromolecular systems includes harmonic bond stretching and angle bending, Fourier series torsional energies and Coulomb plus Lennard-Jones terms for intermolecular and intramolecular interaction.45 Electrostatic potential energy is represented by

Equation 1.8, a summation of Coulombic interactions. This formula with a distance dependent dielectric function lessens the contribution from charge-charge interactions. The distance between two atoms i and j is given by rij. N is the total number of atoms in molecules A and B; q is the charge on each atom.

푁퐴 푁퐵 푞푖푞푗 퐸푐표푢푙(푟) = ∑푖=1 ∑푗=1 (Eq. 1.8) 4휋휀0푟푖푗

12 6 푁 푁 휎푖푗 휎푖푗 퐸푣푑푊(푟) = ∑푖=1 ∑푗=1 4휀 [( ) − ( ) ] (Eq. 1.9) 푟푖푗 푟푖푗

The van der Waals potential energy for non-bonded interactions is represented in Equation 1.9, with the Lennard-Jones 12-6 function. The exponential 12 term is responsible for small-distance repulsion and the exponential 6 term is responsible for the attractive forces arising from instantaneous dipoles. Choosing 12-6 Lennard-Jones potential parameters result in a steep repulsive potential and will be less forgiving of close contacts between receptor and ligand atoms. The well depth, ε, of the potential and collision diameter, σ, are of the atoms i and j.11

This type of scoring models enthalpic gas-phase contributions to structure and energies which leave out solvation terms. Other inputs for solvation can be extended in more force field parameters included in algorithms. This depends highly on how to model the protein of interest.

The common force fields used in docking software programs are AMBER46 and CHARMM47.

Empirical scoring is based on the idea of using experimental data (training set) and the summation of energy terms: vdW, electrostatic, hydrogen bonding, desolvation, entropic and hydrophobic, to get a binding energy score.

∆퐺 = ∑푖 푊푖 × ∆퐺푖 (Eq. 1.10)

ΔG represents the individual empirical energy terms and the corresponding coefficients Wi are determined by reproducing the binding affinity of a training set of protein-ligand complexes with known 3D structures, using least square fitting.40 The data set will need to perform regression analysis and fitting, based on these known structures of protein-ligand complexes.11

Knowledge-based scoring uses potential energy parameters derived from experimentally determined protein-ligand complexes. The downside of this method is the limited number of

known ligand-protein complexes. The main principle used is the potential of mean force (PMF).

A training set is analyzed and the reference complex is compared to the training set.

Consensus scoring combines scoring information of multiple scoring functions to increase the probability of finding a correct conformation. The common programs and algorithms for the types of methods are broken down into groups in Table 1.1.

Each part of the molecular docking procedure has pros and cons when it comes to each type of system. The general idea for docking is to conduct an initial sampling method. Ideally, combing methods will yield the best result. Computational time, program accessibility, cost, and time will determine which methods to use. Cross-docking validation between programs and algorithms may assist to enhance confidence in a docking method or result.

1.4 General protocol for docking

The set-up for any docking procedure is a four-step process but each varies in the availability of literature, experimental work and programs (available for purchase or produced in-house).

1) Target selection – Starting a computationally guided protocol for docking begins with a

target structure. The protein structure can be available from the PDB15, x-ray

crystallography or NMR, most likely experimentally determined. The protein usually

should be in its biologically active form and in a stable conformation. Molecular

minimization and dynamics simulations can help in the preparation of the model structure

of the target. Homology models can be used if a 3D structure is not known. The

coverage of the active site should be similar to prepare a model of high quality. The

structure should be pre-processed by analysis of charges, missing residues, addition of

hydrogens, and necessary water molecules. Incorrect positioning or charge assignments

will greatly affect the docking predictions if the structure is not analyzed. To test the

docking program, the original crystal structure of a protein-ligand complex can be re-

docked.

2) Ligand preparation – As the protein structure is analyzed, the ligands should be treated in

similar fashion. With the increase in 3D structures of protein-ligand complexes, the

ligand can be downloaded from the PDB15. The ligand’s resolution and geometry should

be examined for any distortions. Other databases such as ZINC48 or PubChem49 will have

3D ligands. Ligands may also be drawn in ChemDraw50 and converted to 3D structures.

It is important to minimize using appropriate programs. Energy minimization with an

empirical force field or semi-empirical QM corrects the bond lengths and bond angles of

atoms. The charge(s) of any ligand should be checked along with correct stereochemistry.

3) Docking (ligand-protein complex) – The program with sampling and scoring functions

should be chosen. The correct file formats for the ligand and protein should be used.

Most visualization programs can be used to change between file formats. Always check

the structures when imported into the programs. The information to define the binding

site residues can be entered in the program to assist in better predictions for the ligand-

protein complex.

4) Evaluation of docking results – Each docking program will report their results differently.

The score will report a binding affinity or binding constant, with negative or positive

values or units, depending upon the conventions and parameters used. The absolute

values of the docking scores cannot be compared between programs. When a known

ligand-protein complex is available, the evaluation can be analyzed by the measure of the

root-mean-square deviation (RMSD) between the predicted ligand pose and

experimentally observed heavy atom positions. An RMSD value of zero for a predicted

ligand is exactly the same as the crystal structure. For docking, a cut-off value RMSD of

2Å or less is successful.51

1.4.1 Sampling and scoring functions used in the ORG

The following programs were available in the Ondrechen Research Group: YASARA Structure,

PRODRG, YASARA implemented with AutoDock 4.0, and Schrodinger’s GLIDE. The methods, parameters and evaluation of docking will be discussed in this section.

1.4.2 YASARA Energy Minimization

Small molecules in PDB format can be imported into YASARA Structure25. The selection of a force field can be AMBER03. Select the macro file em_run.mcr to run an equilibration configuration of the atoms in question. The result output from YASARA is the minimized structure and an RMSD value. The input file can be different but for further use in YASARA, the

.pdb file format is easiest.

1.4.3 PRODRG

PRODRG52 is a fast and automated small-molecule topology generator for 3D representation.

To prepare a small molecule for docking, the ligand should have correct geometry, bond length, bond angles and dihedral angles. The input files accepted for PRODRG are ASCII drawing,

SMILES chain, PDB and MOL (Molfile/SDfile) formats. PRODRG processes the input and creates a connection table that assigns the protonation state of any atom using INSHYD (insert

hydrogen) and DELHYD (delete hydrogen). The generation of a topology is then used in

GROMACS (or another energy minimization force field). The output files may be written in various file formats to be incorporated in molecular docking programs. PRODRG performs equally well or better than similar topological programs. The advantage of PRODRG server is the input files, force-field and output file options to speed up the virtual screening process.

1.4.4 YASARA implemented with AutoDock

The docking was performed using AutoDock53 and writing below adapted from their protocol.

The Yet Another Scientific Artificial Reality Application (YASARA) molecular modeling program uses the version of AutoDock454. To perform docking, YASARA incorporates

AutoDock which provides parallel docking capability and docking parameters. The point charges are assigned according to the AMBER03 force field55. AutoDock 4.0 is based on a hybrid of a genetic algorithm and adaptive local search method, Lamarckian genetic algorithm (LGA).51,53

The method mimics the process of natural selection, the Lamarckian model of genetics, in which the environment adaptations of an individual’s phenotype are reverse transcribed into its genotype and becomes a heritable trait.53 The ligand and protein are a set of values describing translation, orientation and conformation of the ligand to the protein. The ligand can be seen as a gene. The state/conformation of the ligand is like a genotype. The exact atomic coordinates of the ligand relate to a phenotype. The fitness in docking is the total interaction energy of the ligand-protein complex. In genetics, the process of crossover is when random pairs of individuals are mated, the new individuals inherit genes from either parent. Some of these new individuals can undergo random mutation. The selection of individuals with better fitness will move on. For docking, the genetic algorithm is in five stages, with mapping to fitness to crossover to mutation

for selection. Mapping takes the ligand conformation and connects to the atomic coordinates for a large number of possible conformations. The possible conformations are tested for fitness. The total interaction energy is evaluated and good values move on for selection. The crossover and mutation breaks the ligand into pieces and the pieces are put together randomly based on translation, torsional and orientation values. The fitness is calculated for the new set of conformations of the ligand-protein complex. The genetic algorithm iterates over generations to find the best docked conformation. AutoDock uses a grid-based method for energy evaluation and efficient search of torsional freedom. The final docking energy (FDE) and the estimated final energy of binding (EFEB) include van der Waals and electrostatic interactions. The number of docking runs can start out small for a test docking but double the runs to test if a better conformation of ligand-protein is predicted. If not, the number of runs do not have to be increased. The best hit of 50 runs was shown to give the best binding energy. When reporting the docking results with YASARA, the binding energy is reported as – kcal/mol free energy, where the best results are those with negative free energies of binding. In the YASARA output, the

BindEnergy command calculates the binding energy of the selected object with respect to the

‘soup’ according to the current force field. The binding energy is obtained by calculating the energy at infinite distance of the unbound state and subtracting the energy of the bound state.

The more positive the binding energy, the more favorable the interaction with the chosen force field in YASARA. In other programs, the reported binding energies are energies of binding and will be negative. The units of YASARA are reported as kJ/mol or kcal/mol depending on the

EnergyUnit in the YASARA setup in the macro. The resulting binding energy is from potential force field energies and cannot be quantitatively compared to experimentally measured free energies of binding. The macro for docking in YASARA will have its settings for force field to

modify if necessary. To analyze the protein structure, a simulation cell is set-up around certain atoms or a point in the structure. The cell is defined by the angles alpha, beta and gamma with directions, x, y and z, respectively. Docking is performed with a simulation cell placed around the atoms; the simulation cell must be large enough for the ligand and the protein binding pocket to fit. The ligand will only yield solutions where the ligand can find space. This option will increase the accuracy of the docking experiment. The receptor can be treated as rigid or flexible.

The settings for this are selected with the simulation cell position. The cell will treat atoms in the coordinates as flexible, while others remain rigid. This option will slow down docking but the active site side-chains will remain free. YASARA contains various macros (.mcr) files to dock multiple ligands (dock_runscreening.mcr) or protein ensembles (dock_runensemble.mcr).

Though the dock_run.mcr can be adapted to modify any docking parameters. The original

AutoDock parameters are available from http://autodock.scripps.edu website and can be adjusted to the parameters within the macros in YASARA.

1.4.5 Schrodinger’s GLIDE

GLIDE (Grid-based Ligand Docking with Energetics)56,57 is a program that searches for favorable interactions between a ligand and protein. GLIDE uses filters to eliminate conformations that are not fit for the active site of the receptor. The flow of the filters investigates the shape and properties of the receptor through grids, generating ligand conformers that may fit into the grids and then searches the best ligand conformers and energy minimizes them into the receptor using a Monte Carlo procedure (Figure 1.10).

Figure 1.10 Filtering system of GLIDE scoring.

GLIDE generates grids for the receptor in the region of the ligand binding pocket using the

Protein Preparation Wizard56. The Protein Preparation Wizard initially adds hydrogen atoms and checks bond orders and formal charges. Using ProtAssign, the hydrogen bonding network is optimized by sampling the degrees of freedom by flipping the amino acid residues terminal rotatable side-chain groups of Asn, Gln, and His by 180º at the terminal chi angle. This changes the spatial hydrogen bonding capabilities of the side-chains. Hydrogen atoms are added to the hydroxyl and thiol groups. The standard mode for ProtAssign performs full sampling of all states for hydrogen-bond clusters, which can be 100 combinations. Protein preparation wizard assigns tautomer/ionization states for the protein and identifies missing residues. The rotatable hydrogen

atoms for the amino acid residue side-chains for Cys, Ser, Thr and Tyr are optimized. Protein minimization with Impref relaxes the structure with the OPLS_2005 force field58 which result in convergence of heavy atoms to an RMSD of 0.3 Å.

Conformers are generated for each ligand. All-atom ligands are prepared using LigPrep. The tasks LigPrep performs are to add hydrogen atoms, filter unsuitable molecules based on their properties, remove any unwanted molecules (waters or ions) in the 3D coordinates, neutralize all charged groups to generate ionization and tautomeric states (using Epik), generate stereoisomers, generate low-energy ring conformations and optimize geometries. GLIDE works by using filters to conduct a conformational search for possible ligands in the active site. The ligand is divided into regions, a core and rotamer groups. The core region is now a set of core conformations and each rotamer group has its own conformation. The same is repeated for other possible core regions in the ligand. Carbon and nitrogen end groups terminated with hydrogen (-CH3, -NH2, -

NH3) are not considered rotatable bonds. Each possible core plus rotamer group is docked through exhaustive search. The search begins with a site-point, GLIDE places the ligand at that point and finds a good match by investigating possible orientations. The second stage of the hierarchical filter is the diameter test. For the atoms that fall within a specified distance, possible orientations are investigated, again avoiding steric clashes. A subset test is set up if the atoms in the ligand can make hydrogen bonds or ligand-metal interactions with the receptor. The orientations are all scored. The next filter is a ‘greedy core” is based on the ChemScore empirical scoring function. This algorithm recognizes hydrophobic, hydrogen-bonding and metal-ligand interactions and penalizes steric clashes. The good orientations are then sent for refinement to generate a best pose. GLIDE will place the conformers into the receptor grid then apply a force field. Energy minimization occurs using the OPLS van der Waals and electrostatic grids for the

protein. The minimized poses of the protein-ligand complex are rescored using GlideScore.

GlideScore is similar to ChemScore but it includes steric clash term, identifies buried electrostatic mismatches with penalties/rewards, amide twist penalties and volume penalties.

퐺푙푖푑푒푆푐표푟푒 = 0.05(푣푑푊) + 0.15(퐶표푢푙) + 퐿푖푝표 + 퐻푏표푛푑 +

푀푒푡푎푙 + 푅푒푤푎푟푑푠 + 푅표푡퐵 + 푆푖푡푒 (Eq. 1.11)

The GlideScore is comparable to binding affinities of different ligands. The best ligands are sampled and evaluated using a Monte Carlo procedure. The Monte Carlo procedure examines nearby torsional minima which will alter internal torsion angles. GlideScore is the predicted binding affinity and rank-ordering of ligands. For GlideScore a negative binding energy is better.

1.5 Evaluation of docking results

There are a few ways to analyze and evaluate the docking results. If there is a crystal structure co-crystallized with a ligand, the ligand can be re-docked into the crystal structure. The resulting pose should be aligned with the co-crystalized ligand. To check the performance, the root mean square deviation (RMSD) for the ligand structures (atom-by-atom) will be calculated. An RMSD value less than 2Å is good performance.10,41,51 The lowest RMSD may not have the best ligand binding pose, so both need to be considered when examining the results. Investigating the interactions and contacts between the ligand and protein is the next step. The pose of the ligand in the active site will explain the likely hydrogen bonding, electrostatic or hydrophobic interactions that enable binding.

A ligand that is predicted to have good binding is not necessarily an effective drug. To find the therapeutic effects and toxicity of a new drug, experimental in vivo and in vitro functional assays

are performed. Prior to docking, understanding the disease target and development of a functional assay enable testing of the predicted ligands. The functional assays assess binding and monitor a response that activates or inhibits the target. Comparing the theoretical binding energy or inhibition constant (Ki) between docking programs is not possible due to their varied methods of sampling and scoring. The same is the case for experimental functional assays; efficacy does not necessarily correlate with the molecular docking results.

1.6 Summary

This chapter introduces the types of molecular modeling methods that are used in this dissertation. The methods include visualization, molecular mechanics, comparative modeling and molecular docking. Application of these methods in drug discovery programs shows that they lower the cost and decrease time for discovery by prioritizing compounds for testing. With improvements in computers and programming, the application of molecular modeling in drug discovery will continue to increase. An example of the application of these methods for computationally guided ligand design for the human adenosine A2A receptor will be discussed in

Chapter 2.

References

1. Lounnas, V.; Ritschel, T.; Kelder, J.; McGuire, R.; Bywater, R. P.; Foloppe, N., Current progress in structure-based rational drug design marks a new mindset in drug discovery.

Computational and Structural Biotechnology Journal 2013, 5 (6), 1-14.

2. Dickson, M.; Gagnon, J. P., Key factors in the rising cost of new drug discovery and development. Nat Rev Drug Discov 2004, 3, 417-429.

3. Paul, S. M.; Mytelka, D. S.; Dunwiddie, C. T.; Persinger, C. C.; Munos, B. H.; Lindborg, S.

R.; Schacht, A. L., How to improve R&D productivity: the pharmaceutical industry's grand challenge. Nat Rev Drug Discov 2010, 9 (3), 203-14.

4. DiMasi, J. A.; Hansen, R. W.; Grabowski, H. G., The price of innovation: new estimates of drug development costs. J Health Econ 2003, 22 (2), 151-185.

5. Leach, A. R., : principles and applications. 2nd ed.; Prentice Hall:

Harlow, England ; New York, 2001; p xxiv, 744 p., 16 p. of plates.

6. Anderson, A. C., The process of structure-based drug design. Chem Biol 2003, 10 (9), 787-97.

7. Seddon, G.; Lounnas, V.; McGuire, R.; van den Bergh, T.; Bywater, R. P.; Oliveira, L.;

Vriend, G., Drug design for ever, from hype to hope. J Comput Aided Mol Des 2012, 26 (1),

137-50.

8. Krieger, E.; Nabuurs, S. B.; Vriend, G., Homology modeling. Methods Biochem Anal 2003,

44, 509-23.

9. Tanrikulu, Y.; Schneider, G., Pseudoreceptor models in drug design: bridging ligand- and receptor-based virtual screening. Nat Rev Drug Discov 2008, 7, 667-676.

10. Elokely, K. M.; Doerksen, R. J., Docking Challenge: Protein Sampling and Molecular

Docking Performance. J Chem Inf Model 2013.

11. Kitchen, D. B.; Decornez, H.; Furr, J. R.; Bajorath, J., Docking and scoring in virtual screening for drug discovery: Methods and applications. Nat Rev Drug Discov 2004, 3 (11), 935-

949.

12. (a) Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J., Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 2001, 46 (1-3), 3-26; (b) Kerns, E. H.; Di, L., Drug- like Properties: Concepts, Structure Design and Methods: from ADME to Toxicity Optimization

Introduction. Drug-Like Properties: Concepts, Structure Design and Methods 2008, 3-5.

13. Jorgensen, W. L., The many roles of computation in drug discovery. Science 2004, 303

(5665), 1813-8.

14. UniProt, C., Ongoing and future developments at the Universal Protein Resource. Nucleic

Acids Res 2011, 39 (Database issue), D214-9.

15. Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.;

Shindyalov, I. N.; Bourne, P. E., The Protein Data Bank. Nucleic Acids Res 2000, 28 (1), 235-42.

16. Haas, J.; Roth, S.; Arnold, K.; Kiefer, F.; Schmidt, T.; Bordoli, L.; Schwede, T., The

Protein Model Portal--a comprehensive resource for protein structure and model information.

Database (Oxford) 2013, 2013, bat031.

17. Pieper, U.; Webb, B. M.; Barkan, D. T.; Schneidman-Duhovny, D.; Schlessinger, A.;

Braberg, H.; Yang, Z.; Meng, E. C.; Pettersen, E. F.; Huang, C. C.; Datta, R. S.; Sampathkumar,

P.; Madhusudhan, M. S.; Sjolander, K.; Ferrin, T. E.; Burley, S. K.; Sali, A., ModBase, a database of annotated comparative protein structure models, and associated resources. Nucleic

Acids Res 2011, 39 (Database issue), D465-74.

18. Bordoli, L.; Kiefer, F.; Arnold, K.; Benkert, P.; Battey, J.; Schwede, T., Protein structure homology modeling using SWISS-MODEL workspace. Nat Protoc 2009, 4 (1), 1-13.

19. Benkert, P.; Tosatto, S. C.; Schomburg, D., QMEAN: A comprehensive scoring function for model quality assessment. Proteins 2008, 71 (1), 261-77.

20. Melo, F.; Feytmans, E., Assessing protein structures with a non-local atomic interaction energy. J Mol Biol 1998, 277 (5), 1141-52.

21. Wallner, B.; Elofsson, A., Identification of correct regions in protein models using structural, alignment, and consensus information. Protein Sci 2006, 15 (4), 900-13.

22. Laskowski, R. A.; MacArthur, M. W.; Moss, D. S.; Thornton, J. M., PROCHECK: a program to check the stereochemical quality of protein structures. Journal of Applied

Crystallography 1993, 26 (2), 283-291.

23. Cavasotto, C. N., Phatak, S. S., Homology modeling in drug discovery: current trends and applications. Drug Discov Today 2009, 14 (13/14), 676-682.

24. Hillisch, A.; Pineda, L. F.; Hilgenfeld, R., Utility of homology models in the drug discovery process. Drug Discov Today 2004, 9 (15), 659-69.

25. Krieger, E.; Koraimann, G.; Vriend, G., Increasing the precision of comparative models with YASARA NOVA--a self-parameterizing force field. Proteins 2002, 47 (3), 393-402.

26. Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J., Basic local alignment search tool. J Mol Biol 1990, 215 (3), 403-10.

27. Ben-Gal, I.; Shani, A.; Gohr, A.; Grau, J.; Arviv, S.; Shmilovici, A.; Posch, S.; Grosse, I.,

Identification of transcription factor binding sites with variable-order Bayesian networks.

Bioinformatics 2005, 21 (11), 2657-66.

28. Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman,

D. J., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Nucleic Acids Res 1997, 25 (17), 3389-402.

29. Hooft, R. W.; Vriend, G.; Sander, C.; Abola, E. E., Errors in protein structures. Nature

1996, 381 (6580), 272.

30. Krieger, E.; Joo, K.; Lee, J.; Raman, S.; Thompson, J.; Tyka, M.; Baker, D.; Karplus, K.,

Improving physical realism, stereochemistry, and side-chain accuracy in homology modeling:

Four approaches that performed well in CASP8. Proteins 2009, 77 Suppl 9, 114-22.

31. Ko, J.; Murga, L. F.; Wei, Y.; Ondrechen, M. J., Prediction of active sites for protein structures from computed chemical properties. Bioinformatics 2005, 21 Suppl 1, i258-65.

32. Tong, W.; Wei, Y.; Murga, L. F.; Ondrechen, M. J.; Williams, R. J., Partial order optimum likelihood (POOL): maximum likelihood prediction of protein active site residues using 3D

Structure and sequence properties. PLoS Comput Biol 2009, 5 (1), e1000266.

33. Gilson, M. K., Multiple-site titration and molecular modeling: two rapid methods for computing energies and forces for ionizable groups in proteins. Proteins 1993, 15 (3), 266-82.

34. Wei, Y.; Ko, J.; Murga, L. F.; Ondrechen, M. J., Selective prediction of interaction sites in protein structures with THEMATICS. BMC Bioinformatics 2007, 8, 119.

35. Bartlett, G. J.; Porter, C. T.; Borkakoti, N.; Thornton, J. M., Analysis of catalytic residues in enzyme active sites. J Mol Biol 2002, 324 (1), 105-21.

36. Sankararaman, S.; Sjolander, K., INTREPID--INformation-theoretic TREe traversal for

Protein functional site IDentification. Bioinformatics 2008, 24 (21), 2445-52.

37. Somarowthu, S.; Yang, H.; Hildebrand, D. G.; Ondrechen, M. J., High-performance prediction of functional residues in proteins with machine learning and computed input features.

Biopolymers 2011, 95 (6), 390-400.

38. Porter, C. T.; Bartlett, G. J.; Thornton, J. M., The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res 2004,

32 (Database issue), D129-33.

39. Lehninger, A. L.; Nelson, D. L.; Cox, M. M., Lehninger principles of biochemistry. 6th ed.; W.H. Freeman: New York, 2013.

40. Huang, S. Y.; Zou, X., Advances and challenges in protein-ligand docking. Int J Mol Sci

2010, 11 (8), 3016-34.

41. Hernández-Santoyo, A.; Tenorio-Barajas, A. Y.; Altuzar, V.; Vivanco-Cid, H.; Mendoza-

Barrera, C., Protein-Protein and Protein-Ligand Docking. 2013.

42. Jones, G.; Willett, P.; Glen, R. C.; Leach, A. R.; Taylor, R., Development and validation of a genetic algorithm for flexible ligand docking. Abstr Pap Am Chem S 1997, 214, 154-COMP.

43. Baxter, C. A.; Murray, C. W.; Clark, D. E.; Westhead, D. R.; Eldridge, M. D., Flexible docking using Tabu search and an empirical estimate of binding affinity. Proteins 1998, 33 (3),

367-82.

44. Eberhart, R.; Kennedy, J. In A new optimizer using particle swarm theory, Micro Machine and Human Science, 1995. MHS '95., Proceedings of the Sixth International Symposium on, 4-6

Oct 1995; 1995; pp 39-43.

45. Jorgensen, W. L.; Maxwell, D. S.; TiradoRives, J., Development and testing of the OPLS all-atom force field on conformational energetics and properties of organic liquids. j Am Chem

Soc 1996, 118 (45), 11225-11236.

46. Weiner, P. K.; Kollman, P. A., AMBER: Assisted model building with energy refinement.

A general program for modeling molecules and their interactions. . J Comput Chem 1981, 2 (3),

287-303.

47. Brooks, B.; Bruccoleri, R.; Olafson, B.; States, D.; S, S.; Karplus, M., CHARMM - a programm for macromolecular energy, minimization, and dynamics calculations. J Comput

Chem 1983, 4, 187-217.

48. Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G., ZINC: a free tool to discover chemistry for biology. J Chem Inf Model 2012, 52 (7), 1757-68.

49. Wang, Y.; Xiao, J.; Suzek, T. O.; Zhang, J.; Wang, J.; Bryant, S. H., PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res 2009, 37

(Web Server issue), W623-33.

50. Cousins, K. R., Computer review of ChemDraw Ultra 12.0. J Am Chem Soc 2011, 133

(21), 8388.

51. Onodera, K.; Satou, K.; Hirota, H., Evaluations of molecular docking programs for virtual screening. J chem Inf model 2007, 47 (4), 1609-1618.

52. Schuttelkopf, A. W.; van Aalten, D. M., PRODRG: a tool for high-throughput crystallography of protein-ligand complexes. Acta Crystallogr D Biol Crystallogr 2004, 60 (Pt

8), 1355-63.

53. Morris, G. M.; Goodsell, D. S.; Halliday, R. S.; Huey, R.; Hart, W. E.; Belew, R. K.;

Olson, A. J., Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. Journal of 1998, 19 (14), 1639-1662.

54. Morris, G. M.; Huey, R.; Lindstrom, W.; Sanner, M. F.; Belew, R. K.; Goodsell, D. S.;

Olson, A. J., AutoDock4 and AutoDockTools4: Automated Docking with Selective Receptor

Flexibility. Journal of computational chemistry 2009, 30 (16), 2785-2791.

55. Duan, Y.; Wu, C.; Chowdhury, S.; Lee, M. C.; Xiong, G.; Zhang, W.; Yang, R.; Cieplak,

P.; Luo, R.; Lee, T.; Caldwell, J.; Wang, J.; Kollman, P., A point-charge force field for molecular mechanics simulations of proteins based on condensed-phase quantum mechanical calculations.

J Comput Chem 2003, 24 (16), 1999-2012.

56. Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D. T.;

Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.; Francis, P.; Shenkin, P. S.,

Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 2004, 47 (7), 1739-49.

57. Halgren, T. A.; Murphy, R. B.; Friesner, R. A.; Beard, H. S.; Frye, L. L.; Pollard, W. T.;

Banks, J. L., Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. J Med Chem 2004, 47 (7), 1750-9.

58. Banks, J. L.; Beard, H. S.; Cao, Y.; Cho, A. E.; Damm, W.; Farid, R.; Felts, A. K.;

Halgren, T. A.; Mainz, D. T.; Maple, J. R.; Murphy, R.; Philipp, D. M.; Repasky, M. P.; Zhang,

L. Y.; Berne, B. J.; Friesner, R. A.; Gallicchio, E.; Levy, R. M., Integrated Modeling Program,

Applied Chemical Theory (IMPACT). J Comput Chem 2005, 26 (16), 1752-80.

Chapter 2

Designing small molecules to target the human A2A adenosine receptor

Parts of this chapter have been published in Bioorganic & Medicinal Chemistry:

Rhiannon Thomas, Joslynn Lee, Vincent Chevalier, Sara Sadler, Kaisa Selesniemi, Stephen

Hatfield, Michail Sitkovsky, Mary Jo Ondrechen and Graham B. Jones. (2013) Design and evaluation of xanthine based adenosine receptor antagonists: Potential hypoxia targeted immunotherapies. Bioorganic & Medicinal Chemistry, 21, 7453-7464.

2.1 Introduction to guanine nucleotide-binding protein (G-protein) coupled receptors (GPCR)

Cells contain proteins, called transmembrane receptors, which bind molecules and perform signal transduction inside the cell resulting in a physiological response. These transmembrane receptors have three main components: an extracellular ligand-binding domain, a hydrophobic membrane-spanning region, and an intracellular domain inside the cell. The guanine nucleotide- binding protein (G-protein) coupled receptor (GPCRs) are the largest family of transmembrane receptors. This family is characterized by the seven helix transmembrane topology.1,2 GPCRs detect a wide spectrum of extracellular signals, including hormones, neurotransmitters, small organic molecules and sensory stimuli.3 GPCRs make up the largest group of drug targets on the market as antihypertensive and anti-allergic drugs.4 GCPR activation involves ligand binding on the extracellular region of the receptor which induces a conformational change, altering the position of the seven transmembrane helices, thereby allowing the receptor to interact with heterotrimeric G-proteins.5 G-proteins are specialized proteins that bind to the nucleotides guanosine triphosphate (GTP) and guanosine diphosphate (GDP). The G-proteins have three different subunits, alpha, beta and gamma, which are encoded by different genes. The combination of these G-proteins will relay different messages for downstream signaling (Figure

2.1).3 The subunits, alpha and gamma, are attached to the plasma membrane through lipid anchors.6 The GPCR acts as a “guanine nucleotide exchange factor” that catalyzes the exchange of GDP (inactive) for GTP (active) on the alpha subunit when an agonist binds.5 To facilitate specific downstream signaling, there are multiple subtypes of the alpha subunit (i.e. Gi, Gs, Golf) required. Various conformational changes can stimulate different G-protein-dependent and – independent pathways for a biological response.1 One well-studied class of GPCRs is the

adenosine receptors, as they are promising therapeutic targets in cerebral and cardiac ischemic disease, sleep disorders, immune and inflammatory disorders and cancer.7

Figure 2.1 Cartoon sketch of GPCR activation to GTP and GDP with a bound agonist. On the left (in blue) is the inactive state of GPCR and on the right (in purple) is the active state of GPCR with an agonist bound. The inactive state contains a GDP attached to the heterotrimeric complex.

When an agonist binds, the conformation of the GPCR activates the G-protein with GTP replacing GDP. The G-protein heterotrimeric subunit dissociates into GTP-bound-alpha subunit and a beta-gamma-complex. Depending on the subtype of the alpha subunit, a specific biological signal transmitted.6

2.2 Types of Adenosine Receptors

The purine nucleoside adenosine is a ubiquitous signaling molecule regulator of tissue function.8

Adenosine is not primarily released in a transmitter- or hormone-like fashion, but is released as a response when cells form groups.9 Adenosine is released through an equilibrative transporter or as a result of cell damage.7 The concentration of adenosine in normal cells is ~300 nM but the concentration increases in inflamed or damaged tissues up to 1200 nM.10 This indicates that adenosine increases under critical conditions in certain tissues throughout the body. Adenosine regulation is mediated by the interaction with four adenosine subtypes, A1, A2A, A2B and A3.

Each adenosine receptor subtype has a unique ligand binding profile, activation profile, subcellular localization and G-protein binding preference.11 These receptor subtypes are coupled to the cyclic adenosine monophosphate (cAMP) second messenger system.12

The receptors differ in affinity for the natural agonist, adenosine, and the rank order of potency

13 being: A1 ≥ A2A ˃ ˃ A3 ≈ A2B. Thus, the A1 and A2A subtypes are labeled as high affinity receptors and A2B and A3 subtypes are low affinity receptors. The subtype activation will lead to downstream signaling pathways depending on the type of G-proteins (Figure 2.2). Activation of

A1 receptors by the Gi proteins, that belong to the pertussis toxin-sensitive family, inhibits

2+ + adenylate cyclase (AC) while Go modulates calcium channels (Ca ), potassium channels (K ) and phospholipase C (PLC). A3 receptors interact with Gq and Gi proteins, Gq stimulates PLC and Gi inhibits adenylate cyclase. A2A adenosine receptors (A2AAR) couple to Gs proteins to activate adenylate cyclase activity which increases the intracellular cyclic adenosine monophosphate (cAMP) concentration. A2B receptors couple with Gs/Gq, proteins to stimulate adenylate cyclase and PLC.7, 14

Figure 2.2 Affinity of adenosine to adenosine receptors. Adenosine will bind (via blue arrows) to one of the four A1, A2A, A2B, or A3 subtypes to inhibit or stimulate downstream responses.

Activation of A1 receptors, inhibits adenylate cyclase (AC) while modulating calcium channels

2+ + (Ca ), potassium channels (K ) and phospholipase C (PLC). A3 receptors stimulates PLC and inhibits adenylate cyclase (AC). A2AAR activate adenylate cyclase activity which increasing cAMP concentration. A2B adenosine receptors stimulate adenylate cyclase and PLC.

The receptor subtypes control the concentration of intracellular cAMP in the cells which is important for signal transduction and regulation. Adenosine receptors are important as drug targets for various diseases due to their wide distribution in the body, shown in Table 2.1. To develop quality drug targets, understanding each subtype activation is necessary. A2AARs are implicated in several diseases and offer the potential to impact neurodegenerative disorders, cardiac regulation, inflammation and cancer.7

Table 2.1 The tissue distribution, modulated systems and disease targets for each subtype.

Abbreviations: adenylate cyclase (AC), potassium channels (K+) and phospholipase C (PLC).7,15

High affinity adenosine receptor Low affinity adenosine receptors

A1 A2A A2B A3 Effector AC↓, K+ channel ↑, AC ↑ AC ↑, PLC ↑ AC ↓, PLC ↓ systems PLC ↑↓ Testis, lung, Brain, heart, Striatum, nucleus kidney, placenta, adipose tissue, accumbens, heart, brain, stomach, vas Tissue olfactory turbercle, Low levels in all spleen, liver, deferens, testis, distribution immune cells, tissues uterus, bladder, spleen, kidney, heart, lung and jejunum, aorta, aorta, eye and blood vessels proximal colon and bladder eyes Neurodegeneration, Parkinson's Disease, Cognitive disease, Huntington's edema, pain, disease, migraine, Glaucoma, renal Disease stroke, epilepsy, sleep disorder, Asthma, diabetes, failure, stroke, lung Target migraine, cardiac respiratory diarrhea injury, cardiac ischaemia, disorder, ischemia, cancer arrhythmia reperfusion injury, thrombosis, hypertension, kidney ischaemia,

2.3 Implications of A2A adenosine receptor in cancer

Understanding inflammation and tissue damage for tumor protection

Uncontrolled inflammation plays an important role in the pathogenesis of major diseases including cancer, heart disease, atherosclerosis and sepsis.16 Understanding the triggers of inflammation will give insight in how to combat these diseases and further how cancer arises from inflammation of tissues. Certain subtypes of adenosine receptors are involved in the signaling for protection of the tissues. Specifically, high affinity A2A and low affinity A2B subtypes are found on human T lymphocyte cells.

Background of cell-mediated immunity

Lymphocytes are part of the cell-mediated immunity response, these cells recognize and respond to foreign antigens which in turn help destroy or directly kill the infected cells. The types of T cells include helper T cells, cytotoxic T cells, regulatory T cells and natural killer (NK) cells.

Helper T cells secrete cytokines, small proteins that are messengers which regulate and mediate other T cells to perform effector functions, such as proliferation, differentiation or activating other cells with cytokines.17 To mediate responses, T cells contain a T-cell receptor (TCR) on the cell surface which recognizes antigens to facilitate in cell-mediated immunity. The two types of

TCRs expressed are CD8+, found on cytotoxic T cells which destroy cells, and CD4+, found on helper T cells that activate cytotoxic cells and macrophages. Trauma or infection to healthy cells cause a response of the immune system and T cells are the first line of defense by secreting proinflammatory cytokines and cytotoxic molecules.16 Unfortunately, the effects of this response causes collateral tissue damage by the destruction of infected or injured cells Adenosine and adenosine receptors are involved in the regulation of normal immune responses, which make

them attractive as molecular targets for immunomodulation.18 Adenosine plays a role in the regulation of inflammation and tissue protection from excessive damage.16 The delayed negative feedback downregulation mechanism shows the process of the activation of immune cells

(Figure 2.3). Damaged tissue (step 1) results in local tissue hypoxia (step 2) and the accumulation of extracellular adenosine (step 3). The adenosine will bind to A2AARs and trigger the production of cAMP (step 4). The increased concentration of cAMP stops intracellular signaling pathways that use proinflammatory molecules like tumor-necrosis factor alpha (TNF-

α), interleukin (IL)-2 interferon (IFN-γ). The immune response terminated by A2AARs allows for pathogen destruction by immune cells before they are shut off by this mechanism. The activation

2 of A2AAR prevents tissue damage was tested in vivo by Ohta et al .

Figure 2.3 Overview of delayed negative feedback downregulation in inflamed local tissue environment.16

-/- Using mice deficient in the A2A receptor gene (Adora2a ) and models of immune-mediated

2 tissue damage, Ohta et al showed the absence of A2AARs increased inflammation and caused more tissue damage. Concanavalin A (ConA) induces hepatic injury in mice mediated by T cells, macrophages, and cytokines, specifically TNF-α and IFN-γ.2,19 Administration of the selective

A2AAR agonist, CGS21680, prevented liver damage and the accumulation of proinflammatory molecule TNF-α. Mice without A2AARs or CGS2160, resulted in continued tissue damage and death for two out of the four mice.2 The accumulation of adenosine was unable to produce

16 intracellular cAMP to downregulate A2AAR processes.

In tumor microenvironments, an overaccumulation of adenosine is produced through the hypoxia-tissue-protecting mechanism that activates A2AARs on the surface of surrounding cells, which leads to the protection of cancerous tissues.20 Ohta et al20 demonstrated the tumor- protecting mechanism, genetic deletion of A2AAR and competitive antagonists to survey tumor growth in mice. (Figure 2.4). Genetic elimination of A2AAR resulted in tumor rejection and tumor growth retardation due to the inhibition of the adenosine-mediated pathway (Figure

2.4B).17 Using the two tumor models, CL8-1 melanoma and RMA T lymphoma, in C57BL/6- background A2AR gene-deficient (A2AR-/-) mice, the tumor growth were measured and compared to wild-type controls. The mice with deleted A2AAR, rejected the CL8-melanoma and survived when the wild-type control died. In the RMA T lymphoma, the gene deficient mice also survived while it was not seen in the wild-type. Ohta et al20 then investigated the inhibition of

A2AAR using the antagonists, ZM241385 and caffeine, on the endogenously developed antitumor cells, CD8+ T cells. The results showed that T cells treated with antagonists delayed the growth of CL8-melanoma tumors. The rejection of tumors was not seen, which may be due to the short half-life of the antagonists.

Figure 2.4 Activation or deactivation of A2A signaling by adenosine, antagonists or genetic deletion of A2AARs. In panel A, when adenosine binds to the A2AAR, cAMP is produced and signals the TCR to inhibit overactive immune cells to protect normal tissues. In hypoxic conditions where adenosine accumulates, the A2A signaling will not inhibit TCR which in turn will protect malignant tissues. Panel B shows that genetic targeting (deletion) of A2AAR will stop production of intracellular cAMP and not signal to the TCR. Panel C utilizes antagonists to prevent cAMP production and stop activated T cells.20

Ohta et al20 showed the presence of elevated extracellular adenosine levels existing in higher levels in the center of solid tumors and lower levels at the edge of the tumor.20 The results provided information for the importance of targeting of A2AAR in hypoxic tumors.

Solid tumors and hypoxia

The hypoxia-adenosinergic tissue-protecting mechanism is triggered by 1) inflammatory damage to blood vessels, 2) interruption in oxygen supply, 3) low oxygen tension or 4) by the hypoxia-

21 driven accumulation of extracellular adenosine that signals the production of cAMP by A2AAR.

In advanced solid tumors, a characteristic feature is the formation of hypoxic tissue areas and occurs in cancers of the breast, uterine cervix, vulva, head & neck, prostate, rectum, pancreas, lung, brain tumors, soft tissue sarcomas, non-Hodgkin’s lymphomas, malignant melanomas, metastatic liver tumors, and renal cell cancer.22 Hypoxia is an imbalance between the supply and consumption of oxygen. In tumors, the cellular O2 consumption rate outweighs the O2 supply

22,23 which in turn leaves portions of the tumor with very low O2 levels. Blood vessels deliver oxygen and nutrients through the body. In tumors, angiogenesis is stimulated and the vasculature becomes abnormal in structure and function. This rapid growth of new vasculature creates a tumor microenvironment with hypoxic regions, low pH and interstitial fluid pressure.24

Figure 2.5 Comparison of healthy tissue to a tumor. In healthy tissue, normal vasculature structure exists with little areas of hypoxia (shaded gray). In the tumor, hypoxic areas are more abundant (shaded gray) and the vasculature is abnormal. The blood vessels (in purple) designate low oxygen present.24

The types of hypoxia are perfusion-related (acute), diffusion-related (chronic) and anemic- related factors. Perfusion-related hypoxia is caused by limited blood flow in tissues due to structural and functional abnormalities of the tumor vasculature (Figure 2.5), such as incomplete endothelial lining, branch irregularity, tangles and uneven vessel diameters.23 Diffusion-related hypoxia is caused by an increase in diffusion distances with tumor expansion because cells far away (70 μm) from a functional blood vessel receives less oxygen and nutrients.22 Lastly, anemic-related hypoxia is caused by reduced O2 transport capacity of the blood by low levels of hemoglobin.23

Hypoxia influences tumor cells by acting as a stressor that impairs growth or causes programmed cell death (apoptosis) and as a factor to promote malignancy progression or induce changes in the genetic changes of the cells to survive in the environment.23 In malignant progression, tumors develop an increased potential for invasive growth, perifocal tumor cell spreading, regional and distant tumor cell metastasis which lead to a poor prognosis.22

Hypoxia is associated with resistance to radiation therapy and chemotherapy. Radiation treatment requires free radicals from oxygen to destroy cells, in hypoxic areas the cells cannot be targeted.25 Chemotherapy treatments require the cytotoxic drugs to be delivered by proper blood flow but due to the leaky vessels and low blood flow, drug delivery cannot occur.24 Efforts are needed to identify solid tumors undergoing hypoxia which can be removed before malignant progression. With many antagonists the half-life in vivo is short. In Ohta et al20, the circulating half-life of ZM241385 is between 30-50 minutes. New chemotherapy treatments need improvement in their therapeutic window but understanding the binding interactions and structure of their protein target is essential.

2.4 Known small molecule therapies targeting the A2A adenosine receptor

The development of adenosine receptor ligands for specific clinical application is challenging due to the intricate signaling by adenosine in the body. The natural agonist of the A2AAR is adenosine, upon which modifications have been explored at the N6 or 2-position and 5‘-position of the ribose (Figure 2.6). At the 2-position, thioethers, secondary amines and alkyne groups result in selective analogues.26 The addition of a 5’N’alkyluronamide to the N2 position generated the potent nonselective agonist 5’-N-ethyluronamide (2.3 NECA). Further molecules

(Figure 2.7) were generated after the success of modification at the N2 position, the agonist 2.5

CGS 21680 added a 2-(2-phenylethyl)amino group that extended a chain for derivitization without losing high affinity of binding to the receptor.26 Two other compounds have modification to the N2 position, 2.4 UK-432097 and 2.6 Regadenoson, but also at the N6 position to increase the affinity to A2AARs.

Figure 2.6 The natural agonist adenosine (A) and the non-selective antagonist molecule caffeine

(B).26

Regadenoson (Lexiscan; Astellas Pharma) is approved by the US Food and Drug Administration

(FDA) for myocardial perfusion imaging in patients with coronary artery disease.27 Table 2.2 shows the inhibitory constant for agonists with each subtype receptor. NECA has high affinity while CGS 21680 proves to be better in selectivity.

For A2AAR antagonists, they lack the sugar moiety and in general possess a mono-, bi-, or tricyclic structure that mimics the adenine part of adenosine.28 Caffeine (2.7 1,3,7- trimethylxanthine) is a methylxanthine and a natural antagonist for A2AAR; it demonstrates antagonism in the micromolar (μM) range and is nonselective. Both xanthine and nonxanthine classes of molecules have been investigated as selective antagonists. Modifications (Figure 2.6) to the xanthine molecule of caffeine have been explored at the 1-, 3-, 5-, 8-, and 7-position to

29 find a selective A2AAR antagonist. The A2AAR tolerates smaller alkyl substituents, i.e. methyl, allyl, propargyl, ethyl, propyl and cyclopentyl, at the N1 position.30 In the N3 position, small alkyl chains, methyl, ethyl, propyl, 3-hydroxypropyl and phenoxy groups can be tolerated.30

CSC (2.10 8-(m-chlorophenylbutadienyl)caffeine) and KW-6002 (2.11 Istradefylline) are two of the best A2AAR antagonists to date as styrylxanthine derivatives. Substitution at C8 improves both selectivity and potency (Table 2.2), which was seen with CSC, upon substitution of a meta- chlorostyryl group, and with KW-6002, upon addition of a styryl phenyl group and meta/para- methoxy substituents.7,31,32 KW-6002 is in Phase III clinical trials as a once-daily oral treatment of Parkinson’s disease (PD).15 Unfortunately these derivatives have high lipophilicity and low water solubility.29 The E-configuration isomers of sytrylxanthines easily undergo photoisomerization in dilute solutions (does not occur in concentrated solution) resulting in mixtures of E and Z isomers.33 As a solid, styrylxanthines can undergo photo-dimerization ([2 +

2] cycloaddition reaction).34

Table 2.2 Functional assay data for agonists and antagonists. Binding experiments are from human A1, A2A, A2B and A3 adenosine receptors, unless noted with ‡ or ¶. N.D. is not determined or not disclosed.7, 15, 26, 35

Pharmacology

Ligands Inhibition constant : Ki (nM) Binding experiments Reference

A1 A2A A2B A3 Adenosine 77 0.5 N.D. 45 functional studies Chen et al. , 2013 5′-N-ethyluronamide (NECA) 14 20 140¶ 25 functional studies Jacobson et al ., 2006 CGS21680 289 27 >10,000¶ N.D. functional studies Jacobson et al ., 2006

Agonist UK-432097 ND 4 ND ND functional studies Mantell et al., 2009 Regadenoson >10,000 290 >10,000 >10,000 functional studies Chen et al. , 2013 Caffeine 10700 23400 33800 13300 functional studies Chen et al. , 2013 Theophylline 6,770 1,710 9,070 22,300 functional studies Muller et al. , 2011 ZM241,385 774 1.6 75 743 functional studies Jacobson et al ., 2006 CSC 28000‡ 54‡ N.D. N.D. functional studies Jacobson et al ., 2006 Istradefylline (KW6002) 841 12 >10,000 4470 functional studies Chen et al. , 2013

Antagonists Preladenant >1,000 0.9 >1,000 >1,000 functional studies Chen et al. , 2013 SYN-115 228.4 0.38 N.D. N.D. functional studies Chen et al. , 2013

¶ ‡ Data are from a cyclic AMP functional assay. Binding experiments are rat A2AARs.

Figure 2.7 Structures of known agonists targeting the A2AARs.

Figure 2.8 Structures of known antagonists targeting the A2AARs.

To overcome this issue, replacement of the vinylene linker with alkynyl, cyclopropyl and 2- naphthyl links has shown loss in affinity. The substitution of heterocyclic ring systems in the place of a xanthine core has led to high affinity non-xanthine antagonists (Figure 2.8) the triazolotriazine, 2.9 ZM241385, the pyrazolotriazolopyrimidine, 2.12 preladenant and the benzothiazole derivative, 2.13 SYN-115.26 Preladenant is undergoing clinical trials for the treatment of Parkinson’s Disease.

Using molecular modeling techniques and past literature on the xanthine class of compounds, we plan to design and chemically synthesize de novo functional analogs of KW-6002 which overcome limitations as a lead compound for hypoxic tumors.

2.5 Proposal to design a small molecule to target the A2A adenosine receptor

The Jones synthetic chemistry group at Northeastern University studies 8-styrylxanthines and investigates potential immunotherapies using KW-6002. The compound shows promising data in immunotherapy models for Parkinson’s Disease.36 This molecule crosses the blood-brain barrier

(BBB), which is desirable for diseases of the central nervous system (CNS) but is undesirable for non-CNS diseases. Unfortunately, KW-6002 is not water soluble but this antagonist is highly selective for the A2AAR. Modifications of this compound can be made for more favorable physiochemical effects. Using molecular modeling techniques, the analogues can be computationally predicted and optimized to guide the synthesis of compounds. The goals of this project are to produce a library of synthetically feasible molecules and to further improve physiochemical properties, i.e. reduce photoisomerization, increase water solubility and decrease penetration of the BBB. Previous research has demonstrated the success of using poly(ethyleneglycol) (PEG) to enhance the hydrophilicity, increase aqueous solubility and

circulating plasma half-life of molecules.37,38 PEG is not toxic and has been demonstrated to be a practical carrier for pharmaceuticals.39 The physiochemical properties expected are to increase the molecular weight which will then decrease the ability to cross the BBB. To create a PEG conjugate (Figure 2.9), understanding the target receptor and its active site will need to be further investigated. A co-crystallized structure of KW-6002 and the A2AAR is not available. There are four crystal structures of human A2AAR structures available in the PDB. Each with different co- crystallized ligands, three are agonists and one is an antagonist.

Figure 2.9 Proposed design of bioconjugate. A) The starting point is to use a xanthine with the addition of potential substituents at R1 and R2. The R2 indicates the PEG linker substituent via a thiol or hydroxyl group. N is the number of PEG chains. B) The bioconjugate linked by a PEG chain to an imaging agent, yellow.

2.6 Understanding the structure and function of the human A2A adenosine receptor to design

small molecules

The overall topological structure of GPCRs include seven transmembrane helices, an extracellular region for a ligand to bind and an intracellular region for G-protein interactions

(Figure 2.10). The A2AAR consists of seven transmembrane helices (TM) with a short helix 8.

The extracellular region contains three extracellular loops (ECL 1-3) and there are three intracellular loops (ICL 1-3) in the intracellular region. In 2008, the first crystal structure of an

A2AAR (PDB 3eml) was solved with a co-crystallized antagonist 4-(2-[7-amino-2-(2-furyl)-

[1,2,4]triazolo-[2,3-a][1,3,5]triazin-5 ylamino]ethyl)-phenol (ZM241385).12 Membrane-bound proteins are difficult to crystallize due to thermal instability. The 3eml structure used the T4- lysozyme (T4L) fusion strategy. The extracellular domain 2 (ECL2) assumes a random coil structure due lack of electron density for residues Gln148 - Ser156.11 The structure reveals information of the types of binding interactions occurring with the co-crystallized ligand.

Studying the binding cavity of the A2AAR will yield information about agonist and antagonist binding. The important residues for ligand recognition are in Table 2.3 and in Figure 2.11.40 The agonist residues are listed as ligand recognition for adenosine. The antagonist residues have been studied by mutagenesis studies for the ZM241385 molecule. This information will be important when building a docking validation model. Phe168, Asn181, Trp246, Leu249, and Ile274 are residues unique to antagonist recognition. Phe168 is the most important residue for adenosine agonist and antagonist high affinity binding in the active site because it has π-stacking interactions with the bicyclic triazolotriazine core.

Figure 2.10 Cartoon view of the 3eml structure. A) Representation of the A2AAR (PDB 3eml). The extracellular region contains three extracellular loops (ECL). The dashed line represents missing residues. The intracellular loops (ICL) near the cytoplasmic side (gray).

The seven transmembrane helices: TM1 (red), TM2 (orange), TM3 (yellow), TM4 (green), TM5 (blue), TM6 (purple) and TM7

(magenta). In -and-stick are lipids (white), sulfate ions (yellow) and the co-crystalized ligand ZM241385. B) 2D structure of the co-crystalized ligand ZM241385 with noted substituents. Image rendered in CHIMERA.41

Table 2.3 Overview of interacting residues from previous literature. This table will aid in the selection of important residues in the docking procedure.

Reported literature residues Residue Residue Binding Reference type Number Glu 13 ligand recognition Gao et al, 2000 Ile 80 ligand recognition Val 84 ligand recognition Jiang et al, 1997 Leu 85 antagonist Ivanov et al, 2009 Thr 88 ligand recognition Jiang et al, 1997 Gln 89 ligand recognition Jiang et al, 1997 Ile 135 ligand recognition Ivanov et al, 2009 Leu 167 ligand recognition Ivanov et al, 2009 Phe 168 antagonist Ivanov et al, 2009 Asn 181 antagonist Kim et al, 1995 Phe 182 ligand recognition Ivanov et al, 2009 Val 186 ligand recognition Ivanov et al, 2009 Trp 246 antagonist Ivanov et al, 2009 Leu 249 antagonist Ivanov et al, 2009 His 250 ligand recognition Jiang et al, 1997/Kim et al, 1995 Ile 274 antagonist Ivanov et al, 2009/ Kim et al, 1995 Ser 277 ligand recognition Jiang et al, 1997 His 278 ligand recognition

The other residues shown in Figure 2.8, Ile80, Val84, Thr88, Gln89, Ile135, Leu167, Phe182,

Val186, His250, Ser277, and His 278 are not specific to antagonist binding, but are also involved in the recognition of agonists and endogenous ligands.

Figure 2.11 Close-up view of active site residues for PDB 3eml. Important active site residues (in gray) with co-crystallized ligand ZM241385 (in black) for PDB 3eml. The first important interaction of ZM241385 is the aromatic stacking of Phe168. The second is a hydrogen bond interaction of Asn253 to N15 atom of ZM241385. Image rendered in CHIMERA.41

After the PDB 3eml structure, two more structures were determined in 2011 by Lebon et al.42

These structures were co-crystallized with adenosine (PDB 2yd0) and NECA (PDB 2ydv).

Adenosine and NECA bind in identical poses (Figure 2.12). The adenine ring in the agonists has similar interactions as ZM241385. N6 has hydrogen bonds with Glu169 and Asn253. There is π- stacking interaction with Phe168 to the adenine ring. The difference between ZM241385 and adenosine/NECA in their binding to the receptor is the presence of the furan substituent (Figure

2.9B). This furan ring forms a hydrogen bond with Asn253. The ribose moiety of

adenosine/NECA forms hydrogen bonds with Ser277 and His278. This difference in binding is between the residues Ser277 and His278. Understanding these interactions will give more information on how to develop the docking validation. Similar molecules with an adenine core will need to exhibit proper pose orientation.

Figure 2.12 Structurally aligned active site of 3eml, 2yz0 and 2yzv. A) The crystal structure of

3eml (in tan), 2yz0 with adenosine (in cyan) and 2yzv with NECA (in dark cyan) have a shift in the residues Phe168, His278 and Glu169. Val84, Leu85 and Thr88 have van der Waals interaction at the bottom of the active site. B) The active site of 3eml with ZM241385 (in black),

2yz0 with adenosine (in cyan) and 2yzv with NECA (in dark cyan) have similar poses of the adenine ring but one difference in the furan ring of ZM241385. Image rendered in CHIMERA.41

To understand the mechanism of ligand-induced GPCR activation, Xu et al.43, reported the second crystal structure (PDB 3qak) of an A2AAR solved with a co-crystallized agonist 2-(3-[1-

(pyridin-2-yl)piperidin-4-yl]ureido)ethyl-6-N-(2,2-diphenylethyl)-5′-N- ethylcarboxamidoadenosine-2-carboxamide (UK-432097).43 This structure was published just before the adenosine and NECA co-crystalized structures. UK-432097 is a potent and selective agonist, similar to 5'-N-ethylcarboxamidoadenosine (NECA)44 and CGS 21680. UK-432097 was developed as a drug candidate for chronic obstructive pulmonary disease (COPD) treatment by

Pfizer. Comparing the two structures (Figure 2.13), 3eml and 3qak, and the different in binding of the compound UK-432097 which is twice the size of ZM241385, gives new information about how the conformational changes in the active site trigger the receptor.43 The bicyclic adenine core of UK-432097 and the bicyclic triazolotriazine (BCT) core of ZM241385 align when the structures are superimposed. These regions interact with Phe168 of the ECL2 through aromatic stacking and the nonpolar interaction to Ile274. One key difference between UK-432097 and

ZM24135 is the ribose moiety, which is commonly seen in agonists. This region occupies the bottom of the cavity. The differences between the transmembrane helices can be seen between

TM1, 5, 6 and 7. The shift in helices is between 1.8 - 2 Å. TM 6 and 7 for the 3eml structure

(antagonist-bound) shift outward compared to the 3qak structure (agonist bound). This can be explained by the smaller ZM241385 ligand not requiring an expanded area for binding compared to the UK-432097. Movements in TM 5 and 6 are shown to be essential for G-protein binding and activation, the “toggle switch” Trp246.45 Investigating the types of ligand-receptor interactions, i.e. hydrogen bonding, nonpolar and aromatic stacking, describes the ligand’s selectivity and stability in the active site.

Figure 2.13 Superposition of the crystal structures of the human A2AARs, 3eml (blue) co- crystallized with ZM241385 (magenta) and 3qak (tan) co-crystallized with UK-432097 (green).

All atoms of amino acid residues were superimposed with an RMSD of 1.17 Å. A) Sideviews of transmembrane helices. B) Extracellular top view of transmembrane helices. Image rendered in

CHIMERA.41

UK-432097 contains 11 hydrogen bonds, one aromatic stacking interaction and many nonpolar interactions while ZM241385 also contains one aromatic stacking interaction, less hydrogen bonds (2) and less nonpolar interactions (5).12,43 These two crystal structures were used to develop a docking validation that would be used to test a library of compounds. When the initial studies began (2009) for this project, the only structure available was 3eml.

2.7 Using molecular modeling tools for docking validation

Docking predicts the most probable binding pose and the likely interactions between a ligand and active site residues in the absence of a solved crystal structure of the ligand-protein complex.

Available experimental information from previously solved crystal structures and mutagenesis studies aid in understanding the interactions occurring in the active site. The present approach also uses the POOL method46 to assist in the identification of residues in or near the binding cavity that may participate in ligand binding. Docking studies were performed using the two programs: Yet Another Scientific Artificial Reality Application (YASARA) Structure, in which a derivative of AutoDock 4.047 is embedded, and Schrodinger’s GLIDE48. This section will discuss the theoretical methods and outcomes from the docking programs. The last tool, homology modeling, builds in a missing region of both crystal structures improving the ligand pose in the active site.

2.7.1 POOL predictions give insight for previously unidentified binding residues

The POOL method, with input features from ConCavity and THEMATICS, was performed on the 3eml and 3qak crystal structures to predict the binding residues. The predicted residues were used to identify the boundary for the simulation cell, required for docking in both AutoDock (in

YASARA) and GLIDE (Schrodinger). POOL generates a rank order of residues. The top 10% of residues from the transmembrane region were selected (Table 2.4), the T4-lysozyme region of the structure was neglected for these calculations. The T4-lysozyme region is essential for crystallization process of the transmembrane protein. From the list (Table 2.4), the residues (in blue) were used to analyze docking results to identify key interactions (hydrophobic, hydrogen bonding, aromatic stacking) that should be observed with antagonists.

Table 2.4 The top POOL predictions of binding residues for the PDB structures 3eml and 3qak.

The POOL results of the two structures yielded similar residues. Using the crystal structures, the residues in blue have significant interactions with their ligands, ZM241385 and UKA-432097.

PDB structure 3eml 3qak POOL method tcranks tcranks 9 TYR 13 GLU 13 GLU 52 ASP 59 ALA 59 ALA 60 ILE 60 ILE 61 PRO 62 PHE 62 PHE 63 ALA 63 ALA 64 ILE 81 ALA 82 CYS 84 VAL 84 VAL 85 LEU 85 LEU 88 THR 88 THR 89 GLN 101 ASP 102 ARG 102 ARG 112 TYR Similar residues 168 PHE predicted by 169 GLU 169 GLU POOL method 185 CYS 194 LEU 197 TYR 227 LYS 245 CYS 245 CYS 246 TRP 246 TRP 249 LEU 249 LEU 250 HIS 250 HIS 264 HIS 271 TYR 271 TYR 273 ALA 274 ILE 274 ILE 275 VAL 275 VAL 277 SER 277 SER 278 HIS 278 HIS 279 THR 279 THR 280 ASN 281 SER 288 TYR 288 TYR

The residues to use to define the cell boundary for docking studies include Glu13, Thr88,

Phe168, Glu169, Trp246, Leu249, His250, His264, Ile274, Ser277 and His278. The residues located at the top of the binding cavity are Glu13 and Glu169. Glu13 on TM 1 does not have direct interactions with ZM241385 or UK-432097, but does participate with TM2, on which residues do have direct interactions with the co-crystallized ligand. Phe168, Glu169, and Leu249 have interactions with the bicyclic core of ZM241385 and UK-432097. Located at the bottom of the binding cavity are the residues, Trp246 and His250, which have hydrophobic interactions with UK-432097. Trp246 is described as the “toggle switch” for antagonists; interaction with this residue prevents downstream signaling by the receptor.12 POOL provided key residues to take into account for the docking results and information for residues to use in the preparation of building a simulation cell.

2.7.2 Initial docking validation using YASARA AutoDock

To provide validation of the docking method when a crystal structure is available of the protein with a bound ligand, the co-crystallized ligand should be re-docked into the crystal structure.

The resulting pose should be aligned with the original co-crystalized structure. To check the performance of the docking method, the root mean square deviation (RMSD) for the predicted and experimental ligand poses (atom-by-atom) is calculated with MUSTANG.49 An RMSD value less than 2 Å is an optimal performance and 3 Å is considered to be good performance.50

The lowest RMSD may not necessarily have the best ligand binding pose so both the RMSD and pose need to be considered when examining the results. Investigating the interactions and contacts between the ligand and protein is the next step. The pose of the ligand in the active site will explain the likely hydrogen bonding, electrostatic or hydrophobic interactions that contribute to the binding energy of the ligand.

Initial validation docking studies were performed using AutoDock 4.09 with the default docking

(AutoDockLGA and 100 runs) parameters supplied and point charges assigned according to the

AMBER03 force field. The setup was done with the YASARA molecular modeling program.

The simulation cell size was 20 Å × 40 Å × 20 Å. The ligands were prepared using

GlycoBioChem PRODRG2 Server20 and minimized in YASARA Structure. In the YASARA

AutoDock output, the BindEnergy command calculates the binding energy of the selected object with respect to the ‘soup’ according to the current force field, AMBER03. In YASARA

AutoDock, the more positive the binding energy, the more favorable the interaction. In other programs, like GLIDE and AutoDock, the reported binding energies are energies of binding and will be negative. For the reported scores from YASARA AutoDock, a negative value will be used to avoid confusion with the convention in the field of molecular docking. A simulation cell

(Table 2.5) was set-up in AutoDock; these were based on the POOL predictions and experimental information available.

Table 2.5 The simulation cell parameters for YASARA AutoDock 4.0. Five residues were selected for the PDB 3eml structure. Eight atoms of these residues were selected for their interactions with the co-crystallized ligand and POOL predictions.

Select atoms from residues Residue Name Atom 1 Atom 2 Glu169 OE1 OE2 His250 ND1 NE2 Asn253 OD1 ND2 His264 NE2 His278 NE2

2.7.3 Initial invalid docking results in YASARA AutoDock

Initial studies in AutoDock 4.0, showed issues for the docking of ZM241385 in the crystal structure of 3eml and of UK-432097 into crystal structure of 3qak. The docked pose for each of the ligands at the top of the binding cavity is pointing towards the ECLs, which is not seen in the experimental crystal structures. Figure 2.14 shows the docked ligand, ZM241385 (pink), is structurally aligned with the crystal structure of the co-crystalized ligand (cyan).

Figure 2.14 The docked ligand ZM241385 in the 3eml crystal structure has an invalid pose. The bicyclic core and furan ring of the ZM241385 is aligned well between the crystal structure ligand and docked ligand while the 4-hydroxyphenylethyl moiety is rotated in the wrong direction in the docked ligand. To note, there are missing residues located on the ELC2 loop which may explain the invalid pose. Image rendered in CHIMERA.41

The RMSD of the crystal structure ligand (ZM241385) aligned with the docked ligand is 1.2862

Å is acceptable. For the bicyclic core triazolotriazine and furan ring region of ZM241385, the

RMSD is less than 0.764 Å. The RMSD of the 4-hydroxyphenylethyl moiety is high with the O4 atom at 8.391 Å and the C1/C7 atoms above 3.6 Å. The reported docking score for this molecule was -8.06 kcal/mol, the more negative, the better the interaction. But upon examination of the

3eml structure, the second extracellular loop (ECL2) is missing residues from Gln148 to Ser156, due to weak experimental electron density in that region.

Docking also was performed on the 3qak structure with UK-432097 and a similar trend was seen for the top of the binding cavity. In Figure 2.15, the top of the cavity which interacts with the phenyl and piperidine rings of the ligand UK-432097, these groups should interact with ECL2 loop. The RMSD of the crystal structure ligand, UK-432097 (gray), with the docked ligand (tan) is 1.2862 Å RMSD. For the 3qak structure, the missing residues in the crystal structure are

Gln148 - Gly158, due to weak electron density. With two docked structures having incorrect poses and the ECL2 loop missing, the decision was made to build in the missing residues because the ECL2 loop has interactions important for positioning disulfide bonds between ECL1 and ECL3.12 The homology modeling module in YASARA Structure11 was used to build in the missing loop into the 3eml and 3qak crystal structures. Another docking experiment was performed on the homology model.

Figure 2.15 Docked UK-432097 compound in the 3qak crystal structure. The bicyclic adenine core and ribose ring are aligned between the crystal structure ligand and docked ligand but discrepancies occur with the phenyl rings and piperidine rings which are rotated the wrong direction in the docked ligand. Image rendered in CHIMERA.41

2.7.4 Optimization of crystal structures: Using homology model to build in missing ECL2 loop

With the core of the two molecules in the correct position, the top of the binding cavity has interactions missing and this causes the 4-hydroxyphenylethyl moiety (ZM241385) and piperidine rings (UK-432097) to be rotated around incorrectly in the docked structure. Once the homology models were made, redocking into the new models was performed.

Preparing homology models

The protein sequences of human A2AARs used to build a model structure came from the correlating FASTA sequence taken from the PDB for each of the crystal structures. The FASTA sequence included the residues located on the second extracellular loop (ECL2). Specifically, the eight residues Gln148 - Ser156 that were missing due to weak experimental electron density in that region, were built in for 3eml. The ten residues Gln148 - Gly158 that were missing due to weak experimental electron density, were built in for 3qak. Homology modeling was performed with the YASARA suite of programs and the quality of the final model was tested.11

YASARA assigns Z-scores to describe how many standard deviations the model quality is away from the average high-resolution X-ray crystal structure. Negative values indicate that the homology model is not as good as the average high-resolution X-ray structure. In example, a Z- score of -3 is three standard deviations below average. The Z-scores for the model structures for

3eml and 3qak were 0.593 and 0.647, respectively (Table 2.6).

Table 2.6 Z-scores of homology models. The overall values for YASARA homology model building ideally should fall near 0 or a positive value. The model quality is designated by

YASARA with a range of labels from ‘optimal’ to ‘good’ to ‘terrible.’ Here the built models for

3eml and 3qak are ‘optimal.’

YASARA Model: 3EML Model: 3QAK DATA Quality Z- Quality Z- Check type Comment Comment score score Dihedrals 1.342 Optimal 1.098 Optimal Packing 1D 0.828 Optimal 1.029 Optimal Packing 3D 0.163 Optimal 0.185 Optimal Overall 0.593 Optimal 0.647 Optimal

Model evaluation

To evaluate the model quality, the structures were examined using the SWISS-MODEL

Workspace PROCHECK51 (Figure 2.16). Both models for 3eml and 3qak were found to be of sufficiently high quality based on the favored positions being above 90% from the scoring of

PROCHECK. The Ramachandran plot of the 3eml model, wherein 95.1% (392/412) of all residues are located in the favored regions suggesting the model has good stereochemistry quality. The Ramachandran plot of the 3qak model, wherein 93.7% (384/410) of all residues are located in the favored regions suggesting the model has good stereochemistry quality. Thus, the model structures meet the PROCHECK standards for a good quality model.

2.7.5 Redocking of co-crystallized ligands into the homology models for AutoDock 4.0 docking

validation

Using the same simulation cell parameters and AutoDock 4.0 settings as before (Section 2.8.2), a new docking analysis was performed on the homology model structures with the ECL2 loop built in for PDB 3eml and 3qak.

In Figure 2.17, the crystal structure of PDB 3eml (in blue) is structurally aligned with the homology model (in purple). The ZM241385 compound of the original crystal structure (in blue) is docked into two the structures: the original crystal structure and homology model. The resulting pose of the crystal structure is seen in pink and the resulting pose of the docked ligand in the homology model is seen in purple. The ligand aligns correctly with the furan ring and bicyclic core triazolotriazine with an RMSD of 0.796 Å, which is well under (better) than the acceptable 3 Å RMSD (Figure 2.18). The 4-hydroxylphenylthyl group’s atoms vary between 2.5-

3.0 Å and this docking result shows the correct position of the oxygen (O4).

Figure 2.16 Ramachandran plot of homology models for 3eml and 3qak with missing loops built in. The residues in the favored position should be above 90% for a quality model, which is seen for the two models. Ramachandran diagrams were generated with SWISS-MODEL workspace.

The 3eml homology model’s ligand pose (in purple) is better than the initial docked compound in the crystal structure (in pink). The docking score in AutoDock was -8.99 kcal/mol for the homology model, a more negative score indicates a better binding, compared to the -8.06 kcal/mol of the initial docked crystal structure. The docking studies revealed the ECL2 provides important interactions to the ligands and affects ligand conformation.

In Figure 2.19, the crystal structure PDB 3qak (in tan) is structurally aligned with the homology model (in green). The UK-432097 compound of the original crystal structure (in tan) is docked into two structures: crystal structure (in yellow) and homology model (in green). The ribose ring and bicyclic adenine core are two features of known adenosine receptor agonists (Figure 2.20) which have an RMSD under 3 Å. The RMSD of the piperidine rings is above 3 Å in one region but is a much better position compared to the original docked structure. For 3qak, the homology model (in green) has the piperdine rings in the correct rotation that is observed in the original crystal structure. The docking score in AutoDock was -16.8 kcal/mol for the homology model; the more negative score indicates a better binding compared to the -14.53 kcal/mol of the initial docked crystal structure without the ECL2 loop.

This docking study reveals that the loop region, ECL2, provides important interactions at the top region of the binding cavity. The ligands, ZM24135 and UK-432097, docked without the built in ECL2 region are invalid. The exact residue interaction are not shown in the previous figures but the overall ligand pose is important to identify the interactions occurring between the ligand and receptor. The RMSD value is important for verifying the set-up of the simulation cell and amino acid residues to target. When docking other ligands with similar core structure into the simulations cell, the predicted binding can be trusted as long as the molecule is not too large.

Figure 2.17 Docking validation of the 3eml homology model in YASARA AutoDock 4.0. The crystal structure with co-crystallized ligand ZM24135 (in blue) is structurally aligned with the docked ZM241385 in the crystal structure (in pink) and also the homology model with docked ligand ZM241385 (in purple). The docked ligand pose from the homology model (purple) is similar to the co-crystallized ligand (blue) which indicates the ECL2 loop is important for interactions with the ligand. The RMSD between the ligand of the crystal structure and the homology model is 0.7959 Å RMSD, a much better docking pose and score than the docked crystal structure. Image rendered in CHIMERA.41

Figure 2.18 Comparing the RMSD of atoms between the docked ligand in the crystal structure and the homology model 3eml. A) The RMSD of docked atoms of the ZM241385 from the crystal structure (Molecule 1 – ZMA-3eml) to the atoms of the docked ZM compound in the homology model (Molecule 2 – ZMA HM). The bicyclic core and furan ring match up correctly in the active site, which is necessary for proper binding interactions. The 4-hydroxphenylethyl group has atoms in red that differ from the original docked homology model, which had much higher RMSD values than 3.6 Å. B) The labeled atom numbers of compound ZM241385.

Figure 2.19 Docking validation of the 3qak homology model in YASARA’s AutoDock 4.0. The

3qak crystal structure with co-crystallized ligand (in tan) is structurally aligned with the docked

UK-432097 in the crystal structure (in yellow) and also the homology model (in green) with docked ligand UK-432097 (in light green). The ligand RMSD between the crystal structure (tan) and homology model (in light green) is 1.3003 Å and the piperidine rings are in the correct pose for the homology model. Image rendered in CHIMERA.41

Figure 2.20 Comparing the RMSD of atoms between the docked ligand in the crystal structure and homology model of 3qak. A) The RMSD of docked atoms of the UK-432097 from the crystal structure (Molecule 1 – UKA-3qak) to the atoms of the docked UK-432097 compound in the homology model (Molecule 2 – UKA_HM). The bicyclic core, ribose ring and phenyl rings match up correctly in the active site, which is necessary for interactions of for proper binding.

Atoms in blue are above 2 Å and below 3 Å. The red residues are above 3 Å and these are located in the region of thepiperidine rings. The homology model yields a much better pose than the docked crystal structure. B) The labeled atom numbers of compound UK-432097.

The validated 3qak homology model can be used for identifying molecules of interest in future docking studies. For this project, the antagonist binding is more important for the class of molecules of interest to the synthetic chemists and the structure of interest for future diagnostic and therapeutic applications. Therefore, the 3eml homology model structure will be is used for the subsequent docking studies of a new class of xanthine-based molecules.

2.7.6 Perform docking in GLIDE as a cross-docking validation for the homology model 3eml

For docking in Schrodinger’s GLIDE, the ligands were prepared using the LigPrep of Maestro v9.3.515 in the Schrödinger Suite 2012.21 The ligands were pre-processed through LigPrep specifying a pH value of 7.0. The model for A2AAR (PDB 3eml) structure was prepared using the Maestro 9.3 protein preparation wizard (Schrodinger, LLC, 2012, New York, NY) before docking, bond orders were assigned and the orientation of hydroxyl groups, amide groups of the side chains of Asn and Gln, and the charge state of histidine residues were optimized. A restrained minimization of the protein structure was performed using the default constraint of 0.3

Å RMSD and the OPLS 2001 force field.22 The enclosing box (the centroid of the docked ligand around selected residues) dimension was set to 25 Å around selected residues of the homology model (Table 2.6). Both standard precision (SP) and extra precision (XP) docking of ligands were carried out, XP docking scores are reported. Water molecules were not included in these docking studies, as they are not present in the homology model. The scoring for GLIDE combines the energy grid score, the binding affinity predicted by GlideScore, and the internal strain energy for the model potential used to direct the conformational-search algorithm. The output of GlideScore is a positive value, the higher the value the better the affinity.

The GLIDE docking values of ZM241385 (Figure 2.21) cannot be directly compared to

AutoDock docking binding energies due to the different force field and scoring used in the separate programs. What can be taken from this information is to compare the docking pose. The unique bicyclic core from these known ligands is universal in the molecules investigated in this study.

Figure 2.21 Comparison of AutoDock and GLIDE docking results of ZM241385. A) The xanthine core aligns to the crystal structure (in black) for both the AutoDock (in gray) and

GLIDE (in white) results. However the 4-hydroxphenylethyl moiety for the GLIDE results is shifted over from the ZMA. The docking model’s RMSD is over 3 Å for the C3, O4 and C5.

The xanthine core region’s RMSD is below 1.0 Å which is a very good value. B) The RMSD values for each atom of ZMA. C) The atom numbers labeled for ZM241385. Image rendered in

CHIMERA.41

2.8 Applying YASARA AutoDock and GLIDE docking to xanthine-based molecules

In a collaboration between the Jones synthetic organic group and the Ondrechen group at NU, a class of arylxanthines was investigated. Optimizing the target bioconjugate, molecular modeling studies were pursued parallel to synthetic efforts. Following the docking validation performed on known ligands, a docking study was performed on a class of xanthines. The xanthine core is a point-of-concept to develop a better molecule than KW-6002 (previously discussed in Section

2.5). This section is a summary of computationally guided ligand design for a set of compounds investigated using the molecular modeling tools, AutoDock 4.0 and Schrodinger’s GLIDE. The results are two compounds, from which one promising lead compound that was synthesized was tested in in vitro functional assays.30

2.8.1 Understanding interactions of KW-6002 to the A2A adenosine receptor

There is no known crystal structure of 2.11 KW-6002 bound to the A2AAR. To identify similar binding poses between the molecules, KW-6002 was docked into the 3eml molecule (Figure

2.22A) and compared to the binding pose of the ZM241385 molecule by structural alignment using MUSTANG52 (Figure 2.22B). Structurally, the molecule KW-6002 contains a bicyclic core similar to ZM241385 and contains an aromatic moiety, these regions are in contact with Phe168 through aromatic stacking and polar interactions with Asn253 and Glu169. The furan ring of

ZM241385 differs from the arene of the KW-6002 molecule but both participate in hydrophobic interactions (Val84, Leu85 and Thr88) at the bottom of the cavity. The aromatic ring at the top of cavity makes hydrophobic interactions with Ile267 and Met270. The docking values between the two molecules are listed in Figure 2.22C. The trends between the predictions of the two docking programs are similar with ZM241385 having the better binding value.

Figure 2.22 The antagonist, KW-6002, docked into the 3eml homology model. A) The predicted pose of the molecule KW-6002 from docking programs AutoDock (in dark yellow) and GLIDE

(in light yellow) docked into the homology model. B) The molecule KW-6002 predicted pose aligned with theco-crystallized ZM241385 molecule (in black) in the crystal structure 3eml. The xanthine core of KW-6002 aligns with the bicyclic core triazolotriazine of ZM241385, which establishes the pose location of xanthine based antagonists. C) Reported docking values from

AutoDock and GLIDE. Image rendered in CHIMERA.41

2.8.2 Applying YASARA AutoDock and GLIDE docking to methyl ester arylxanthines

A family of arylxanthine isomers derivatized as ortho-, meta-, and para- methyl esters (Figure

2.23) were scrutinized as surrogates for the PEG-ylated versions. The ortho-methyl ester derivative is not planar according to the GLIDE docking but according to the AutoDock, the molecule is planar (Figure 2.24). The meta-methyl ester had a significantly improved binding energy over the ortho-derivative, with the methyl group facing away from the binding cavity.

The para-methyl ester arylxanthine shows a different pose in GLIDE, flipping the bicyclic core around versus the AutoDock pose.

Figure 2.23 Molecules of the A) ortho- 2.14, B) meta- 2.15, C) para- 2.16 methyl ester arylxanthines.

Figure 2.24 Docking results of the ortho-, meta-, and para- methyl ester arylxanthines. The active site of A) ortho-methyl ester, B) meta-methyl ester and C) para-methyl ester are shown with the lighter colored docked ligand being the GLIDE pose and the darker colored the AutoDock pose. The arylxanthines have similar binding poses in both AutoDock and GLIDE. The binding energy trends show the meta- and para-ester being better than the ortho. Image rendered in CHIMERA.41

The predicted pose of the arylxanthines fit in the binding cavity of the active site but the addition of an imaging agent may encounter clashes with residues located at the top of the protein. To avoid possible unwanted interactions, lengthening the molecules with a linker will allow the imaging agent to protrude outside of the binding cavity. It was decided to adda vinylene linker between the xanthine core and the arene. The synthetic process was not affected by this addition of a vinylene linker, the molecule mimics the structure of KW-6002.

2.8.3 Applying YASARA AutoDock and GLIDE docking to methyl ester styrylxanthines

The stryrlxanthines (Figure 2.25) with high structural similarity to KW-6002 were designed to provide appropriate spacing between the active site and bioconjugate. Both the meta- and para-

PEG derivatives demonstrated significantly improved binding energies over the arylxanthine compounds, 9.572 kcal/mol and 9.409 kcal/mol respectively in AutoDock and Glide Score

-9.000 and -8.500, respectively (Figure 2.26). The ortho-derivative had decent binding energy but the location of interactions the methyl ester group may lead to steric clashes once a PEG- derivative is added. The critical –stacking between the xanthine core and Phe168 and hydrogen bonding between xanthine C2-carbonyl and Asn253 contribute to the high affinity of these stryrylxanthines. The incorporation of the vinyl linker extends the molecule so that the arene is stabilized by hydrophobic contact with Met270, which is a contact observed in the binding of the phenyl pharmacophore in ZM241385.

Since neither analogue displayed a significant advantage over the other in binding at the xanthine core or the styryl pharmacophore, attention was directed to the conformation of the oligoethylene glycol moiety. Upon conjugate binding to the receptor, the polymeric carrier should not disrupt

the binding of the antagonist. Therefore, the PEG chain should lie outside the active site, as it does with the para-PEG conjugate.

Figure 2.25 Molecules of the A) ortho- 2.17, B) meta- 2.18, C) para- 2.19 methyl ester styrylxanthines.

Figure 2.26 Docking results of ortho-, meta-, and para- methyl ester stryrylxanthines. The active site of A) ortho-methyl ester, B) meta-methyl ester and C) para-methyl ester are shown with the lighter colored docked ligand from GLIDE and the darker colored from AutoDock. The stryrylxanthines have similar binding poses in both AutoDock and GLIDE. The binding energy trends show the meta- and para-derivatives being the better binders, which was seen in the arylxanthines. Image rendered in CHIMERA.41

2.8.4 Results of best binders for PEG derivatives

Both the meta- and para-methyl esters of the stryrylxanthine compounds exhibited better docking poses and binding energies than the ortho-position. The next step for modeling was to add the oligoethylene glycol (PEG) moiety. The addition of the PEG was easily conjugated to the meta- or para- position through site-specific condensation of the PEG hydroxyl (Figure 2.27).

PEG is commercially in a variety of lengths to be used in the synthetic addition later.

Figure 2.27 The styrylxanthine PEG-analog selected for additional docking studies. The highlighted blue region of the compound is equivalent to KW-6002.

The binding energy for the meta-derivative was -8.194 kcal/mol and the para-derivative -9.20 kcal/mol in AutoDock. The Glide Scores were -7.723 and -7.960, respectively (Figure 2.28).

Figure 2.28 Models of PEG-para and PEG-meta stryrylxanthines derivatives. The active site of

A) PEG-para derivative and B) PEG-meta derivative are shown with the lighter colored docked ligand being the GLIDE pose and the darker colored the AutoDock pose. The GLIDE and

AutoDock poses are similar but the GLIDE poses are shifted towards the top of the cavity by 0.5

Å. Image rendered in CHIMERA.41

Since neither analog displayed a significant advantage over the other in binding at the xanthine core or the styryl pharmacophore, attention was directed to the conformation of the oligoethylene glycol moiety. Upon conjugate binding to the receptor, the bioconjugate should not disrupt the binding of the antagonist. Therefore, the PEG chain should lie outside the active site, as it does with the PEG-para conjugate. The docking of the PEG-meta derivative shows the diethylene glycol folded back into the pocket, and although this did not affect the conjugate’s theoretical binding energy, it is indicative of an interference that may occur with a longer ethylene glycol chain (n  4). Unfortunately, docking of a PEG chain with n  4 was not possible with AutoDock or GLIDE and would most likely have been unreliable because of the large number of internal degrees of freedom. Based on these observations, the PEG-para analogue of KW-6002 was selected for development to include an 8-mer PEG-chain.

2.9 Proposed synthesis of xanthine-based molecules

The synthetic efforts to develop A2AAR antagonist bioconjugates for use in the diagnosis and treatment of hypoxic tumors were performed in the Jones lab by Rhiannon Thomas-Tran. The scale up of para-PEG was performed by Vincent Chevalier.30

The Jones lab developed an efficient method for the synthesis of 8-substituted xanthines from

5,6-diaminouracils (2.26) and carboxaldehydes using bromodimethylsulfonium bromide.53,54 The first compound to synthesize was an ortho substituted aryl ester which was converted to the

PEGylated derivative. The synthesis is explained in detail in the thesis of Rhiannon Thomas-

Tran.55 The formation of (2.26) diaminouracil occurs over three steps from starting reagents diethylurea and cyanoacetic acid (Figure 2.29).

Figure 2.29 Initial synthesis of diaminouracil.53,54

The synthesis of the styrylxanthine began with Heck coupling of acrolein to iodobenzoate under phosphine-free conditions (Figure 2.30). The reaction converts to the cinnamaldehyde product

(2.27) without aryl-aryl homocoupling or iodide elimination. The compound (2.27) is coupled to diaminouracil (2.26) in the presence of BDMS. The product (2.29) is isolated in high purity from

DMSO/water. Xanthine methylation and ester saponification yielded the functionalized strylxanthine (2.30) from polymer conjugation. Using PEG methyl ethers (PEG-Me) to conjugate the stryrylxanthine compound, specifically an octaethylene glycol-Me chain was chosen due to a higher isolated yield compared to another PEG-Me. Coupling of EDCI (additional 1.1 equiv.

EDCI and 10 mol % DMAP with overnight reflux) produced 40% isolated yields. Synthetically, the route to 2.31 did not require chromatography at any stage prior to PEG coupling, allowing efficient scale up.

Figure 2.30 Synthesis of styrylxanthine-PEG conjugate. The PEG analog 2.30, n = 8, was synthesized and scaled-up for in vivo and in vitro assays.

The compound was water soluble and stable in solution. No isomerization of the PEGylated variants was detected after 2 weeks in solution, using an HPLC analysis. The precise mechanism of this protective function is unknown; the steric constraints in the PEG may have an impact, favoring the E isomer over the less potent Z isomer. The improvement in physiochemical properties promised a better performing compound to the KW-6002. As the scale up of the

PEGylated-para derivative was underway, preliminary in vitro assays were performed on the lead compound.30

2.10 Test synthesized compounds with in vitro functional assays

In vitro functional assays were performed on the lead compound in the Sitkovsky lab by Dr.

Kaisa Selesniemib and Dr. Stephen Hatfield, to confirm efficacy.30

To determine the A2A binding-dependent signaling, bioassays of the PEG-para analogue were performed with two functional models. The first assay, cAMP assay, tested the PEG analog 2.31 by measuring the extent of inhibition by inducing A2A intracellular cAMP accumulation in A2A- expressing lymphocytes. To activate the A2AAR in these assays, the agonist CGS 21680 was used. The antagonists, KW-6002 and our PEG-para analog should prevent CGS 21680-mediated signaling. The antagonistic potency of our PEG-para molecule is predicted by docking to be stronger than that of KW-6002 and it exhibits better properties. The second assay tests cytokine secretion, a sensitive test to determine the functional effects of A2AAR signaling. After TCR activation by CD3 ligand, the T cells from splenocytes are incubated with CGS 21680, which prevents IFN-gamma (IFNγ) secretion and increases the A2A receptor-induced immunosuppressive intracellular cyclic adenosine monophosphate (cAMP) concentration.

Activated T cells produce IFNγ and are significantly inhibited by the immunosuppressive cAMP- elevating A2AAR signaling. The assay in the presence of increasing quantities of KW-6002 and the PEG-KW should restore IFN-gamma secretion if they are good antagonist molecules.

2.10.1 Measuring functionality of A2AAR antagonism by cAMP assay

Stimulation of intracellular cAMP production and measurement of cAMP levels were performed as described previously.2, 18b Lymphocytes were isolated from the spleen of C57/BL6 mice and treated with at 1 M CGS 21680, an A2A-specific agonist (provided by Tocris, Ellisville, MO).

The concentrations between 0.1-10 uM of KW-6002 and PEG-para were used. The cells were incubated for 15 min at 37C, and the reaction was stopped by addition of 1M hydrochloric acid.

Levels of cAMP were determined by enzyme-linked immunosorbent assay (ELISA). All treatment groups were done in triplicate. The CGS 21680 agonist stimulates the A2AAR, both

KW and PEG-para inhibit signaling (Figure 2.31). PEG-para performs better than KW in these studies.

Figure 2.31 cAMP functional assay to measure the functional effects of A2AAR signaling.

Cyclic-AMP levels in lymphocytes after incubation with vehicle (VEH), 1uM CGS (CGS

21680), 1uM CGS plus 0.5-10um KW (KW), and 1uM CGS plus KW-PEG (PEG-para) is shown. The intracellular cAMP levels were determined 15 minutes following stimulation using quantitative cAMP ELISA and are expressed as pmols/million cells. Data shown represent mean

± SEM of triplicate samples.

2.10.2 Using the cytokine release assay of PEG-para reveals strong A2AAR antagonism than

KW-6002

Lymphocytes were isolated from the spleen of C57/BL6 mice and cultured with 0.1 μg/ml CD3 mAb to induce production of IFN-gamma. Immediately following the addition of mAb-CD3, the cells were treated with or without 1, 10, or 100 nM CGS 21680 agonist. To examine functionality of the antagonists, KW or PEG-para (0.5 um) was added to the cells. After 24 hours, supernatants were collected and the concentration of IFN-gamma was measured by

ELISA using paired mAb and standard purchased from BD Pharmingen. All treatment groups were done in triplicate.

The second assay, a cytokine secretion, is more sensitive to measure the functional effects of A2A receptor signaling. In this type of assay, following the TCR activation by CD3 ligand, the T cells from splenocytes are incubated with CGS 21680 which prevents IFN-gamma secretion by the process of increasing A2A receptor-induced immunosuppressive intracellular cAMP. The activated T cells produce IFN-gamma (Figure 2.32) and are inhibited by engagement of immunosuppressive cAMP-elevating A2A receptor signaling. The assay with KW-6002 and

PEG-para restore the IFN-gamma secretion when concentration is increased.

Figure 2.32 Cytokine secretion assay to measure the functional effects of A2A receptor signaling.

The INF-gamma production by lymphocytes after activation with 0.1 ug/ml mAB-CD3 and when treated with vehicle (VEH), 1-100 nM (CGS 21680), 1-100 nM CGS plus 0.5 uM KW

(KW-6002), and 1-100 nM CGS plus -0.5 uM KW-para (PEG-para) is shown. The IFN-γ levels were determined in the supernatant one day following stimulation using quantitative ELISA and are expressed as pg/ml. Data shown represent mean ± SEM of triplicate samples.

KW-6002 and PEG-para have similar activity for the in vitro assays. This result also helps to validate the molecular modeling strategy used in the design. The molecule does not undergo photoisomerization and is more stable in solution, an improvement over KW-6002 physiochemical properties.

The computationally guided ligand design process was effective in assisting the selection of the top priority for the lead compound to be synthesized and tested by functional bioassays. The identification of xanthine analogues with desirable physical and chemical properties were noted in the molecular docking screening. With the functional in vitro assays complete, the design of in vivo tumor regression studies using hypoxic models well be performed in future studies

2.11 Summary

Utilizing molecular modeling techniques and computational tools is a directed approach to improve A2AAR xanthine based-antagonists (i.e. caffeine and KW-6002) and to the identification of a promising lead compound (PEG-para) (Figure 2.33).

Figure 2.33 The lead compound, PEG-para 8-mer.

Using an efficient method to synthesize the styrylxanthine analog and the PEGylated derivatives resulted in a lead compound unaffected by photoisomerization. The lead compound was tested in two functional bioassays. The first was to test the inhibition of A2AAR-induced intracellular cAMP accumulation in the A2AAR-expressing lymphocytes. The second biological assay was of cytokine secretion to determine the functional effects of A2AAR signaling. The results of the lead

compound compared to the antagonist KW-6002 demonstrated a suitable cancer immunotherapy agent for hypoxic tumors. The next step for this project is to scale up the production of the lead compound, label the compound with either an 18F or 123I tag to conduct biodistribution studies in rodent models. The final steps will be to design and conduct in vivo tumor regression studies using hypoxic models in tumor regression studies.

References

1. O'Hayre, M.; Vazquez-Prado, J.; Kufareva, I.; Stawiski, E. W.; Handel, T. M.; Seshagiri,

S.; Gutkind, J. S., The emerging mutational landscape of G proteins and G-protein-coupled receptors in cancer. Nat Rev Cancer 2013, 13 (6), 412-24.

2. Ohta, A.; Sitkovsky, M., Role of G-protein-coupled adenosine receptors in downregulation of inflammation and protection from tissue damage. Nature 2001, 414 (6866), 916-20.

3. Venkatakrishnan, A. J.; Deupi, X.; Lebon, G.; Tate, C. G.; Schertler, G. F.; Babu, M. M.,

Molecular signatures of G-protein-coupled receptors. Nature 2013, 494 (7436), 185-94.

4. Rask-Andersen, M.; Almen, M. S.; Schioth, H. B., Trends in the exploitation of novel drug targets. Nat Rev Drug Discov 2011, 10 (8), 579-90.

5. Ritter, S. L.; Hall, R. A., Fine-tuning of GPCR activity by receptor-interacting proteins. Nat

Rev Mol Cell Biol 2009, 10 (12), 819-30.

6. Li, J.; Ning, Y.; Hedley, W.; Saunders, B.; Chen, Y.; Tindill, N.; Hannay, T.; Subramaniam,

S., The Molecule Pages database. Nature 2002, 420 (6916), 716-7.

7. Jacobson, K. A.; Gao, Z. G., Adenosine receptors as therapeutic targets. Nat Rev Drug Discov

2006, 5 (3), 247-64.

8. Baraldi, P. G.; Tabrizi, M. A.; Gessi, S.; Borea, P. A., Adenosine receptor antagonists: translating medicinal chemistry and pharmacology into clinical utility. Chem Rev 2008, 108 (1),

238-63.

9. Fredholm, B. B.; AP, I. J.; Jacobson, K. A.; Klotz, K. N.; Linden, J., International Union of

Pharmacology. XXV. Nomenclature and classification of adenosine receptors. Pharmacol Rev

2001, 53 (4), 527-52.

10. (a) Matherne, G. P.; Headrick, J. P.; Coleman, S. D.; Berne, R. M., Interstitial transudate purines in normoxic and hypoxic immature and mature rabbit hearts. Pediatr Res 1990, 28 (4),

348-53; (b) Hirschhorn, R.; Roegner-Maniscalco, V.; Kuritsky, L.; Rosen, F. S., Bone marrow transplantation only partially restores purine metabolites to normal in adenosine deaminase- deficient patients. J Clin Invest 1981, 68 (6), 1387-93; (c) Cronstein, B. N.; Naime, D.; Ostad, E.,

The antiinflammatory mechanism of methotrexate. Increased adenosine release at inflamed sites diminishes leukocyte accumulation in an in vivo model of inflammation. J Clin Invest 1993, 92

(6), 2675-82.

11. Piirainen, H.; Ashok, Y.; Nanekar, R. T.; Jaakola, V. P., Structural features of adenosine receptors: from crystal to function. Biochim Biophys Acta 2011, 1808 (5), 1233-44.

12. Jaakola, V. P.; Griffith, M. T.; Hanson, M. A.; Cherezov, V.; Chien, E. Y.; Lane, J. R.;

Ijzerman, A. P.; Stevens, R. C., The 2.6 angstrom crystal structure of a human A2A adenosine receptor bound to an antagonist. Science 2008, 322 (5905), 1211-7.

13. Sherbiny, F. F.; Schiedel, A. C.; Maass, A.; Muller, C. E., Homology modelling of the human adenosine A2B receptor based on X-ray structures of bovine rhodopsin, the beta2- adrenergic receptor and the human adenosine A2A receptor. J Comput Aided Mol Des 2009, 23

(11), 807-28.

14. Gessi, S.; Merighi, S.; Sacchetto, V.; Simioni, C.; Borea, P. A., Adenosine receptors and cancer. Biochim Biophys Acta 2011, 1808 (5), 1400-12.

15. Chen, J. F.; Eltzschig, H. K.; Fredholm, B. B., Adenosine receptors as drug targets--what are the challenges? Nat Rev Drug Discov 2013, 12 (4), 265-86.

16. Sitkovsky, M. V.; Lukashev, D.; Apasov, S.; Kojima, H.; Koshiba, M.; Caldwell, C.; Ohta,

A.; Thiel, M., Physiological control of immune response and inflammatory tissue damage by

hypoxia-inducible factors and adenosine A2A receptors. Annual review of immunology 2004, 22,

657-82.

17. Sitkovsky, M. V.; Kjaergaard, J.; Lukashev, D.; Ohta, A., Hypoxia-adenosinergic immunosuppression: tumor protection by T regulatory cells and cancerous tissue hypoxia. Clin

Cancer Res 2008, 14 (19), 5947-52.

18. (a) Koshiba, M.; Kojima, H.; Huang, S.; Apasov, S.; Sitkovsky, M. V., Memory of extracellular adenosine A2A purinergic receptor-mediated signaling in murine T cells. J Biol

Chem 1997, 272 (41), 25881-9; (b) Apasov, S. G.; Chen, J. F.; Smith, P. T.; Schwarzschild, M.

A.; Fink, J. S.; Sitkovsky, M. V., Study of A(2A) adenosine receptor gene deficient mice reveals that adenosine analogue CGS 21680 possesses no A(2A) receptor-unrelated lymphotoxicity. Br J

Pharmacol 2000, 131 (1), 43-50.

19. Gantner, F.; Leist, M.; Lohse, A. W.; Germann, P. G.; Tiegs, G., Concanavalin A-induced

T-cell-mediated hepatic injury in mice: the role of tumor necrosis factor. Hepatology 1995, 21

(1), 190-8.

20. Ohta, A.; Gorelik, E.; Prasad, S. J.; Ronchese, F.; Lukashev, D.; Wong, M. K.; Huang, X.;

Caldwell, S.; Liu, K.; Smith, P.; Chen, J. F.; Jackson, E. K.; Apasov, S.; Abrams, S.; Sitkovsky,

M., A2A adenosine receptor protects tumors from antitumor T cells. Proc Natl Acad Sci U S A

2006, 103 (35), 13132-7.

21. Sitkovsky, M. V., T regulatory cells: hypoxia-adenosinergic suppression and re-direction of the immune response. Trends Immunol 2009, 30 (3), 102-8.

22. Vaupel, P.; Mayer, A., Hypoxia in cancer: significance and impact on clinical outcome.

Cancer Metastasis Rev 2007, 26 (2), 225-39.

23. Vaupel, P.; Harrison, L., Tumor hypoxia: causative factors, compensatory mechanisms, and cellular response. Oncologist 2004, 9 Suppl 5, 4-9.

24. Carmeliet, P.; Jain, R. K., Principles and mechanisms of vessel normalization for cancer and other angiogenic diseases. Nat Rev Drug Discov 2011, 10 (6), 417-27.

25. Harris, A. L., Hypoxia--a key regulatory factor in tumour growth. Nat Rev Cancer 2002, 2

(1), 38-47.

26. Muller, C. E.; Jacobson, K. A., Recent developments in adenosine receptor ligands and their potential as novel drugs. Biochim Biophys Acta 2011, 1808 (5), 1290-308.

27. Ghimire, G.; Hage, F. G.; Heo, J.; Iskandrian, A. E., Regadenoson: a focused update. J

Nucl Cardiol 2013, 20 (2), 284-8.

28. de Lera Ruiz, M.; Lim, Y. H.; Zheng, J., Adenosine A Receptor as a Drug Discovery

Target. J Med Chem 2013.

29. Muller, C. E.; Jacobson, K. A., Xanthines as adenosine receptor antagonists. Handb Exp

Pharmacol 2011, (200), 151-99.

30. Thomas, R.; Lee, J.; Chevalier, V.; Sadler, S.; Selesniemi, K.; Hatfield, S.; Sitkovsky, M.;

Ondrechen, M. J.; Jones, G. B., Design and evaluation of xanthine based adenosine receptor antagonists: potential hypoxia targeted immunotherapies. Bioorg Med Chem 2013, 21 (23),

7453-64.

31. Adenosine Receptors in Health and Disease. Springer: New York, 2009; Vol. 193.

32. Yang, M.; Soohoo, D.; Soelaiman, S.; Kalla, R.; Zablocki, J.; Chu, N.; Leung, K.; Yao, L.;

Diamond, I.; Belardinelli, L.; Shryock, J. C., Characterization of the potency, selectivity, and pharmacokinetic profile for six adenosine A2A receptor antagonists. Naunyn Schmiedebergs

Arch Pharmacol 2007, 375 (2), 133-44.

33. Muller, C. E.; Sandoval-Ramirez, J.; Schobert, U.; Geis, U.; Frobenius, W.; Klotz, K. N.,

8-(Sulfostyryl)xanthines: water-soluble A2A-selective adenosine receptor antagonists. Bioorg

Med Chem 1998, 6 (6), 707-19.

34. Hockemeyer, J.; Burbiel, J. C.; Muller, C. E., Multigram-scale syntheses, stability, and photoreactions of A2A adenosine receptor antagonists with 8-styrylxanthine structure: potential drugs for Parkinson's disease. J Org Chem 2004, 69 (10), 3308-18.

35. Mantell, S. J.; Stephenson, P. T.; Monaghan, S. M.; Maw, G. N.; Trevethick, M. A.;

Yeadon, M.; Walker, D. K.; Selby, M. D.; Batchelor, D. V.; Rozze, S.; Chavaroche, H.;

Lemaitre, A.; Wright, K. N.; Whitlock, L.; Stuart, E. F.; Wright, P. A.; Macintyre, F., SAR of a series of inhaled A(2A) agonists and comparison of inhaled pharmacokinetics in a preclinical model with clinical pharmacokinetic data. Bioorg Med Chem Lett 2009, 19 (15), 4471-5.

36. Rao, N.; Dvorchik, B.; Sussman, N.; Wang, H.; Yamamoto, K.; Mori, A.; Uchimura, T.;

Chaikin, P., A study of the pharmacokinetic interaction of istradefylline, a novel therapeutic for

Parkinson's disease, and atorvastatin. J Clin Pharmacol 2008, 48 (9), 1092-8.

37. Iyer, A. K.; Khaled, G.; Fang, J.; Maeda, H., Exploiting the enhanced permeability and retention effect for tumor targeting. Drug Discov Today 2006, 11 (17-18), 812-8.

38. Haag, R.; Kratz, F., Polymer therapeutics: concepts and applications. Angew Chem Int Ed

Engl 2006, 45 (8), 1198-215.

39. Harris, J. M.; Chess, R. B., Effect of pegylation on pharmaceuticals. Nat Rev Drug Discov

2003, 2 (3), 214-21.

40. Ivanov, A. A.; Barak, D.; Jacobson, K. A., Evaluation of homology modeling of G-protein- coupled receptors in light of the A(2A) adenosine receptor crystallographic structure. J Med

Chem 2009, 52 (10), 3284-92.

41. Pettersen, E. F.; Goddard, T. D.; Huang, C. C.; Couch, G. S.; Greenblatt, D. M.; Meng, E.

C.; Ferrin, T. E., UCSF Chimera--a visualization system for exploratory research and analysis. J

Comput Chem 2004, 25 (13), 1605-12.

42. Lebon, G.; Warne, T.; Edwards, P. C.; Bennett, K.; Langmead, C. J.; Leslie, A. G.; Tate,

C. G., Agonist-bound adenosine A2A receptor structures reveal common features of GPCR activation. Nature 2011, 474 (7352), 521-5.

43. Xu, F.; Wu, H.; Katritch, V.; Han, G. W.; Jacobson, K. A.; Gao, Z. G.; Cherezov, V.;

Stevens, R. C., Structure of an agonist-bound human A2A adenosine receptor. Science 2011, 332

(6027), 322-7.

44. Christofi, F. L.; Cook, M. A., Antagonism by theophylline of the adenosine receptor agonist 5'-N-ethylcarboxamidoadenosine at the guinea pig ileum. Can J Physiol Pharmacol

1985, 63 (9), 1195-7.

45. Katritch, V.; Cherezov, V.; Stevens, R. C., Structure-function of the G protein-coupled receptor superfamily. Annual review of pharmacology and toxicology 2013, 53, 531-56.

46. Tong, W.; Wei, Y.; Murga, L. F.; Ondrechen, M. J.; Williams, R. J., Partial order optimum likelihood (POOL): maximum likelihood prediction of protein active site residues using 3D

Structure and sequence properties. PLoS Comput Biol 2009, 5 (1), e1000266.

47. Morris, G. M.; Huey, R.; Lindstrom, W.; Sanner, M. F.; Belew, R. K.; Goodsell, D. S.;

Olson, A. J., AutoDock4 and AutoDockTools4: Automated Docking with Selective Receptor

Flexibility. Journal of computational chemistry 2009, 30 (16), 2785-2791.

48. Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D. T.;

Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.; Francis, P.; Shenkin, P. S.,

Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 2004, 47 (7), 1739-49.

49. Konagurthu, A. S.; Reboul, C. F.; Schmidberger, J. W.; Irving, J. A.; Lesk, A. M.; Stuckey,

P. J.; Whisstock, J. C.; Buckle, A. M., MUSTANG-MR structural sieving server: applications in protein structural analysis and crystallography. PLoS One 2010, 5 (4), e10048.

50. (a) Hernández-Santoyo, A.; Tenorio-Barajas, A. Y.; Altuzar, V.; Vivanco-Cid, H.;

Mendoza-Barrera, C., Protein-Protein and Protein-Ligand Docking. 2013; (b) Elokely, K. M.;

Doerksen, R. J., Docking Challenge: Protein Sampling and Molecular Docking Performance. J

Chem Inf Model 2013; (c) Onodera, K.; Satou, K.; Hirota, H., Evaluations of molecular docking programs for virtual screening. J chem Inf model 2007, 47 (4), 1609-1618.

51. Laskowski, R. A.; MacArthur, M. W.; Moss, D. S.; Thornton, J. M., PROCHECK: a program to check the stereochemical quality of protein structures. Journal of Applied

Crystallography 1993, 26 (2), 283-291.

52. Konagurthu, A. S.; Whisstock, J. C.; Stuckey, P. J.; Lesk, A. M., MUSTANG: a multiple structural alignment algorithm. Proteins 2006, 64 (3), 559-74.

53. Dong, M.; Sitkovsky, M.; Kallmerten, A. E.; Jones, G. B., Synthesis of 8-substituted xanthines via 5,6-diaminouracils: an efficient route to A(2A) adenosine receptor antagonists.

Tetrahedron Lett 2008, 49 (31), 4633-4635.

54. Labeaume, P.; Dong, M.; Sitkovsky, M.; Jones, E. V.; Thomas, R.; Sadler, S.; Kallmerten,

A. E.; Jones, G. B., An efficient route to xanthine based A(2A) adenosine receptor antagonists and functional derivatives. Org Biomol Chem 2010, 8 (18), 4155-7.

55. Thomas, R. Design and synthesis of adenosine receptor antagonist bioconjugates as anti- tumor immunotherapies. Northeastern University, Boston, MA, 2011.

Chapter 3

Introduction of protein function annotation method: Structurally

Aligned Local Sites of Activity (SALSA)

3.1 Introduction of the Protein Structure Initiative (PSI) and structural genomics (SG)

Determining the function of a protein from its three-dimensional (3D) structure is an important problem in the post-genomic era. In the Protein Data Bank1 (PDB) there are over 97,351 deposited structures (January 2014); 12,900 of these are structural genomics (SG) proteins.2 In

2000, the Protein Structure Initiative was launched in the United States to make the 3D structures of proteins easily attainable from the knowledge of their corresponding DNA sequences. The PSI project generated a large set of 3D structures named structural genomics proteins. The largest centers include Midwest Center for Structural Genomics (MCSG), Joint Center for Structural

Genomics (JCSG), Structural Genomics Consortium (SGC), Northeast Structural Genomics

Consortium (NSGC) and Center for Structural Genomics of Infectious Disease (CSGID), which contain their own websites and information. All the information is not inserted into the PDB, for example the MCSG contains its own database with “Justification for Selection.” The aim of the

PSI shifted to generate research in the areas of 1) protein function, 2) identifying better therapeutics for genetic and infectious diseases, 3) increase new experiment designs and 4) develop better methodologies for protein production and crystallography.3 By July 2010, the total cost of the PSI endeavor totaled to more than three-quarters of a billion dollars, pushing it easily over a billion by the end of the last phase of the project in 2015.4 In Table 3.1, the various phases, goals, outcomes for the PSI are shown in detail. The National Institute of General

Medical Sciences (NIGMS) does not plan to continue the PSI after the PSI: Biology Phase is completed in 2015.5 Overall, the information generated from these SG structures can yield useful information to aid in drug target design, understand unique genome sequences or better understand evolutionary relationships. Though a large number of solved structures categorized as

SG proteins have been identified as hypothetical (1,513) or assigned unknown (2,977) function,

many SG proteins were annotated (assigned function) using sequence or structural similarity matching tools. Similar sequence does not always infer similar function and issues arise especially in cases of low sequence similarity, resulting in misannotations. For sequence homology-based assignments, a high sequence identity (>40-60%) is required for an accurate function transfer.6 Sequence and structural similarity can be unreliable and cannot be applied to proteins with novel folds or low homology with other proteins.7 The exact number of misannotations is unknown. There is a need to identify the incorrect or uncharacterized structural genomics proteins.

Table 3.1 Overview of the three phases of the Protein Structure Initiative. This includes the budget, goals, number of centers and cumulative total of structures solved.5

3.2 Current computational methods to determine function of uncharacterized proteins

The current experimental methods to determine function are performed by high-throughput screening of functional assays, which are heavily time and resource dependent.8 Alternatively, a computational approach to assign biological or biochemical roles of enzymes is a faster and less costly option with available bioinformatics tools, database information, and computational methods. The two types of computational methods are sequence-based or structure-based, depending on the availability of protein sequence homologues or 3D structural information.

Figure 3.1 Methods used to identify the function of a protein using sequence or structural information. Examples of programs that fall into each method category are listed in red. The background is further described in sections 3.2.1 and 3.2.2

3.2.1 Sequence-based methods

Sequence similarities between two proteins can imply structural, functional and evolutionary relationships between species or between two similar proteins of the same species. The most popular tools for sequence similarity search are Position-Specific Iterative - Basic Local

Alignment Search Tool (PSI-BLAST)9 and FASTA10. PSI-BLAST compares a query protein sequence to other sequences in a database to identify short matches for local alignment using a statistical score of the alignment. FASTA finds regions of local or global similarity by search through protein databases.11 Evolutionary Trace (ET)12 exploit sequence conservation and then rank amino acids in a sequence by their relative evolutionary importance. A unique sequence- based method involves motif searching with a sequence tool like MOTIF.13 MOTIF searches sequence databases, PROSITE14, Pfam15, InterPro16, which have information about protein families and domains, and identifies significant sites, patterns and profiles of known protein structures. These sequence methods rely on high coverage and on high sequence similarity

(>30%) for the function transfer to be valid, but is not the case in distantly related proteins, in less characterized organisms or families, or in most structural genomics proteins.17 The

Information-theoretic Tree Traversal for Protein Functional Site Identification (INTREPID) server uses a protein sequence to identify homologs which are aligned and a phylogenetic tree is produced. The INTREPID method assigns a score to each position with scores based on the degree of conservation, incorporating the phylogenetic tree information.18 Though sequence- based methods give insight, the structural information can provide valuable aid in the functional identification through analysis of the fold or the active site of a protein.

3.2.2 Structure-based methods

Structural input provides information about remote evolutionary relationships, through folds, which give more features about the biochemical function than does the sequence alone.

Homologous fold relationships can be described as 1) proteins having a common ancestor and are members of the same structural superfamily or 2) protein converged to the same fold and have non-common evolutionary history.6a Protein structure classification databases like

Structural Classification of Proteins (SCOP)19 and CATH20 use structural alignments to catalogue relationships between protein families by investigating domains and homology.

Protein families that have global structural similarity may not correspond to similar function.

This is seen with the TIM-barrel fold, ferredoxin fold and Rossman fold, for instance.21

Structure-based function prediction methods use global or local structure similarity with known proteins to compare to the query protein.22 Global methods investigate shape and provide necessary information while local methods investigate residue composition and probable mechanisms in the active site. Commonly used global structure-based methods, such as combinatorial extension (CE)23, Dali24, TopoFit25 and Vector Alignment Search Tool (VAST)26, investigate and compare the overall shape. The global structure, in terms of pocket information and geometric shape have been shown to identify function of closely related proteins; this is the idea used by the programs ProFunc27 and F-seeker28. Local structure-based methods compare local regions of a query protein to a database of known functional sites, active sites of enzymes.22

Commonly used local structure-methods like FLORA29 and SURFACE30, identify regions of a query protein and search through the databases. The CASTp31 and Pocket Surfer32 use geometric information to identify active sites to which can be compared with a database of known active sites, such as the Catalytic Site Atlas (CSA)33. Solvent mapping is another method to use small

molecule probes to identify functional regions on protein surfaces; one such program available is

FTMap34. Each provides a prediction of functionality but the combination of multiple methods can give more information about the function.

Sequence- and structure-based methods have demonstrated that predicting the active site residues is the key to unlocking the function of a structure. A combination of sequence and structure prediction methods is one option to identify all active site residues with high precision and accuracy. The method described in this chapter is the development and optimization of the functional annotation method, Structurally Aligned Local Sites of Activity (SALSA).35 SALSA identifies the local active site using previously characterized proteins of known function within a superfamily to then assign function to proteins with similar active sites (Figure 3.2). This applies to structural genomics proteins when the structure is known but functional information is unavailable or limited. Current and former Ondrechen group members: Dr. Pengcheng Yin, Dr.

Ramya Parasuram, Dr. Srinivas Somarowthu, and Dr. Zhouxi Wang, along with me, have contributed to the development of this method and have tested SALSA on various superfamilies.

3.3 Background of Structurally Aligned Local Sites of Activity (SALSA)

A superfamily is a group of proteins whose members, as defined by Babbitt and Gerlt36, catalyze the same chemical reaction but do not share the same substrate specificities and have less sequence similarity than members of the same family. “Mechanistically diverse superfamilies diverged even earlier and have greater diversity in that many of their constituent families show no significant sequence similarity and do not share the same substrate specificities, so families within them catalyze different overall reactions and share only some mechanistic attributes.”37

Investigating similarities within a superfamily reveal small differences that explain how the enzymes catalyze the same chemical reaction.

Figure 3.2 Overview of the SALSA methodology. A superfamily contains subgroups of structurally related proteins. The available 3D structures are submitted to POOL (functional site predictor) which predicts active site residues. The common features of these sets of active site residues become a “consensus signature” for each subclass. Each subclass is compared to one another by multiple structure alignment (MSA). A structural genomics (SG) protein is submitted to POOL and the resulting set of predicted active site residues is compared to the consensus signature of each subgroup by multiple structure alignment. The alignment is assigned a score that measures whether the function is a good or poor match with the Normalized Match

Consensus Signatures.

Characterized proteins in superfamilies can be used as a known dataset to confirm the prediction from the functional site predictor, Partial Order Optimum Likelihood (POOL) and the SALSA method (Figure 3.2). Some superfamilies are well-studied and contain important experimental and structural information. Using POOL, the active site of each member of a superfamily can be predicted computationally and verified with known literature information. Once the data set is complete, SALSA analysis can be performed on structural genomics proteins that are suggested to have the function of the superfamily or have structural similarities. The SALSA method is very advantageous to use because it identifies the local active site; identification of the local structure containing the active site residues is more reliable in assigning biochemical function than overall fold comparison. Focusing on the local site gives definitive implications of a function which can be then validated by experiments. The Ondrechen Research Group (ORG) developed the SALSA method to classify a known set of proteins of a superfamily and properly assign the function to structural genomics proteins that are found within the superfamily. The use of various molecular modeling and bioinformatics tools were used to extrapolate important information. From these predicted functions, the SG proteins can be experimentally validated using functional assays.

3.4 Procedure to run the SALSA analysis

This section will discuss the parameters and steps (Figure 3.2) used in the SALSA methodology.

A superfamily is selected and all known 3D structures are identified in the PDB. The protein structures are submitted to POOL and the output information is used to generate a SALSA table.

Using a scoring matrix, the table is used to assign function to structural genomics proteins of unknown or uncertain function.

3.4.1 Organizing the superfamily

The PASS2 Database38 and SCOP database19 are classification databases that provide detailed information of the structural and evolutionary relationships between all proteins whose structure is known. The information collected includes number of members in a superfamily, protein names and protein PDB IDs. Using the web server DALI24, one can obtain a list of structurally similar protein structures, some of which may have not been listed in the PASS2 or SCOP databases. The PDB page contains the reference for the solved structure, if such a reference is available, and often these references identify important active site residues. This information is used to validate further the POOL predictions. An alternative to scouring the literature is the

Catalytic Site Atlas33 (CSA). This database is used to identify catalytic residues in a query protein by searching for homologues that have been manually curated and that have literature references containing evidence showing that specific residues participate in catalysis. Not all proteins are found in the CSA, so scouring the literature is necessary. Issues arise in the search for structures due to incorrect and inconsistent nomenclature, unavailable experimental literature in PDB, high sequence identity and repetitive structures (higher or lower resolution and different bound ligands). For present purposes, the data set for a superfamily is chosen to be highly diverse by selecting from many species and seeking the lowest possible sequence homology, when available. Known proteins with missing regions of the 3D crystal structure is common.

For these cases, homology modeling is used to build in the missing regions.

3.4.2 Homology Modeling

With the crystallization methods for structural genomics proteins varying due to high-throughput methods, portions of the structures are often incomplete. The background of homology modeling was discussed in Chapter 1. The program YASARA39 was used to build homology models for proteins that contained missing regions or belong to a well-characterized subclass with limited structures available. Templates for the models were chosen to exclude structures from the same subclass to avoid bias.

3.4.3 Using functional site predictor POOL in SALSA

Once all proteins are collected, the available 3D structures are submitted to the functional site predictors to gain information about active site residues. In Chapter 1, a background of functional site predictors available were discussed. In SALSA, the active site predictor program

POOL is used. The POOL program incorporates various inputs; electrostatics data (using

THEMATICS40), pocket information (from ConCavity41) and phylogenetic scores (with

INTREPID42), to predict active site residues. Using multiple inputs, essentially more information, provides a better prediction of active site residues, as shown by Dr. Srinivas

Somarowthru.7 Each of these inputs can be selected in the POOL submission process. For example, to include the input from INTREPID, an output file from that server is required.

During the development of SALSA, the INTREPID server went offline and is currently unavailable. The input from ConCavity is built in to the POOL calculations, so no additional files are needed to upload. To run POOL calculations, the ./POOLTIC.sh script is used because it includes all the input features. The output files will be generated with THEMATICS, ConCavity and/or INTREPID information. The rank order of predicted residues will appear in list format

(txt.file) in four files. The output files incorporating all input types is labeled as .TICranks. The

ranked residues are kept in an “output rank file” a txt.file. The cut-off value for predicted POOL residues is set at 8-10%.

Table 3.2 Names of the POOL output files using the POOL program.

POOL program Output files generated POOL ranks with input(s) .bm4ranks THEMATICS .tcranks THEMATICS ConCavity ./POOLTIC.sh .ibm4ranks THEMATICS INTREPID .TICranks THEMATICS INTREPID ConCavity

3.4.4 Multiple structure alignments (MSA) of subclasses

The use of multiple structure alignment tools is important for analysis of local active sites. The alignment methods vary and produce different results; finding a program that focuses on the local spatial region of the active site is the best approach. There are a variety of structural alignment algorithms and servers. Like PDBeFOLD43, TopoFit25, CE-MC44, and Tree-based

Consistency Objective Function for alignment Evaluation (T-COFFEE) Espresso45 available to perform pairwise and multiple structural alignment of proteins. PDBeFold is a web-based server that performs structure alignment based on the identification of residues occupying “equivalent” geometrical positions. The results can be sorted based on the Q-score (Quality of Cα- alignment) with 1 being the highest, P score (taking into account RMSD, number of aligned residues, number of gaps, number of matched Secondary Structure Elements and the SSE match score), Z score (based on Statistics), RMSD (root mean square deviation) and percent sequence identity.43 The output file is a residue‐by‐residue mapping which can be analyzed to find similarity in regions of the proteins. The advantage of PDBeFold43 is the output files (fasta.aln) are provided in compatible structure analysis and comparison programs. The main disadvantages of the program is that it will crash when a high number of non-similar protein structures are

submitted. It was observed, smaller numbers of proteins and more structurally similar proteins align fine. Also, the input is only a PDB structure, homology models are in-house structures cannot be aligned. TopoFit25 is a web-server that identifies the largest group of residues which have the same neighbors in the same locations common in both compared structures using

Delaunay tessellation. The output of TopoFit includes a list of structurally similar proteins with information regarding the lengths of the structural alignment, the RMSD between structures, a Z- score, and percent sequence identity of the alignments.46 The advantage of TopoFit is the approach of identifying regions of similarity in the protein without relying on sequence information. However, the wait time for results is long and their output files are not easy compatible to structure analysis and comparison programs. The program will crash as seen with

PDBeFold. Combinatorial Extension and Monte Carlo (CE-MC) server is an expansion of CE23 which performs pairwise alignments. The CE-MC server program is based on two independent algorithms, combinatorial extension and Monte Carlo optimization. CE algorithm performs an all-against-all pairwise alignment for protein structures submitted and the end result is a Z-score from these alignments. The alignments are used to generate a guide tree (clusters of alignments) which are optimized using Monte Carlo (MC) simulation.44 CE-MC has the advantage of quick calculations but the program crashing occurs the same as PDBeFold and TopoFit. T-COFFEE

Espresso45 is a server that takes PDB IDs as an input to perform both pairwise sequence alignments and pairwise structure comparison methods to generate multiple sequence alignments. With this information, the MSA program is able to align better between proteins that are not structurally similar (>40% identity).45 The advantage is the multiple structures can be from any source and the program does not crash between structures that have low structure similarity. The output file is compatible with structure analysis and comparison programs.

For the SALSA method, the MSA algorithms need to handle multiple submissions and the ability to align non-homologous 3D structures of proteins. Each of these methods were tested to determine which produced the best alignments. Both PDBeFOLD and TCOFFEE Espresso met the requirements for this step. Initially, the MSA process was manually performed (Figure 3.3) to create an overall structure alignment of the superfamily. The process follows as the 1) creation of an alignment of proteins within a subclass, then 2) selection of a “representative” structure from each subclass, one structure with the best resolution and known experimental data, to be incorporated in the 3) creation of an overall alignment between all subclass “representatives.” 4) use structure analysis and comparison programs to manually curate alignments at the local active site region using POOL information.

Figure 3.3 Overview of the process for creating a multiple structure alignment of the superfamily. Alignments within a subclass are performed using TCOFFEE. A representative is selected from each subgroup and aligned with TCOFFEE. Manual evaluation is performed to check for any issues due to non-homologous proteins.

The next step in the SALSA method is to use structural comparison programs to manually curate alignments at the local active site region using POOL information. This process was slow and tedious. An automated program, General Local Alignment Table (GLAT) was created to speed up this step (Figure 3.4). GLAT was developed by Dr. Pengcheng Yin; the full background can be found in Dr. Yin’s dissertation.47 Protein structures were aligned using the PDBeFold Server.

Figure 3.4 Steps for the automated program Generating Local Alignment Table (GLAT).

Combining this with the output POOL rank file, a local alignment table is generated. This is similar to the manually curated one discussed above. The original multiple structure alignment generated manually was compared to the one generated by GLAT. The two alignments are similar, except that GLAT missed a few residues that aligned between subclasses. Visual inspection is needed to search for gaps or mismatches in the alignment, which is often the case when the number of aligned proteins is large.

3.4.5 Molecular visualization of protein structures and sequences

The gain a better understanding of multiple structure alignments, SALSA utilizes two programs for visualizing structures and running analyses, UCSF Chimera package48 and YASARA49.

Chimera is developed by the Resource for Biocomputing, Visualization, and Informatics at the

University of California, San Francisco. created with YASARA

(www.yasara.org) use the program POVRay (www.povray.org) to generate images. Each program has advantages for reading various types of files, multiple programs and calculations available. CHIMERA is free for academic institutions while YASARA requires the purchase of an academic license.

Important tools provided from these programs include:

1. Structure analysis and comparison: visualizing the 3D alignment and 2D sequence

alignment

2. Measure distances between atoms for RMSD calculations

3. Create labels with text, symbols, and arrows in 2D, set background color, gradient, or

image, ribbon or amino acid residue representation, various color options for multiple

protein structures

4. Read local files: FASTA Alignment, text, PDB, python and sdf and export compatible

file formats

5. Save high-quality images for publication

3.4.6 Generating a SALSA table

The multiple structure alignment table is cleaned up after visual inspection is performed to create a first draft SALSA table. The rows represent individual protein structures and the columns represent spatially aligned positions. (Figure 3.5) The POOL predicted residues are shown in uppercase and the lower case letters represent aligned residues that are not POOL predicted. The

SALSA table will be used to generate consensus signatures for each subclass.

Subclass PDB ID 1 2 3 4 5 6 PDB A E53 l87 E163 s165 S215 g236 1 PDB B E47 L81 E157 h159 S210 G231 PDB D E51 l83 E159 n161 S211 g233 PDB E Y10 G38 l139 a141 F198 G199 2 PDB F F22 g51 i153 p155 F212 G213 PDB G F22 G51 i153 p155 F212 g213 PDB H a6 v50 s125 D127 g195 v222 3 PDB I a9 v52 g128 D130 g197 g222 PDB J A9 V52 G128 D130 g201 g226 PDB K C9 l50 a128 D130 g202 a224 4 PDB L C10 l53 a131 D133 g205 A227 PDB M C243 l297 S402 D404 S500 A523

Figure 3.5 A hypothetical example of a generated SALSA table. The PDB code will be different for each sequence. The spatially aligned amino acid residues are listed for each structure, identified by its PDB ID.

3.4.7 Generating consensus signatures (CS)

With the SALSA table created, aligned active site residues of common amino acid type between the proteins in a functional subclass will be identified and called the consensus signature for that subclass. The consensus signature for a given biochemical function thus consists of a series of amino acid types in specified spatial positions. Using information from experimentally

confirmed active site residues and POOL predictions, the characteristic active site residues were identified in each subclass. As a benchmark to check this process, the CSA was used to test the number of residues POOL predicted as active site residues. POOL performs very well in identifying the correct residues.7 Another advantage POOL has is to identify other residues that are not identified by the CSA. The literature contains information based on mutagenesis studies but not all residues have been tested experimentally. The CSA lists a few residues, as literature annotation of most catalytic residues is not complete. If two structurally similar proteins contain a conserved region of active site residues, it is difficult to determine whether they are likely to have the same biochemical function if one only has two or three residues to compare.

Position number - Aligned spatial position Subclass PDB ID 1 2 3 4 5 6 PDB A E53 l87 E163 s165 S215 g236 Subclass 1 PDB B E47 L81 E157 h159 S210 G231 PDB D E51 l83 E159 n161 S211 g233 PDB E Y10 G38 l139 a141 F198 G199 Subclass 2 PDB F F22 g51 i153 p155 F212 G213 PDB G F22 G51 i153 p155 F212 g213 PDB H a6 H48 s125 D127 g195 v222 Subclass 3 PDB I a9 H50 g128 D130 g197 g222 PDB J A9 H50 G128 D130 g201 g226 PDB K C9 v48 a128 D130 g202 a224 Subclass 4 PDB L C10 a51 a131 D133 g205 A227 PDB M C243 t295 S402 D404 S500 A523

Figure 3.6 A hypothetical example of consensus signatures. The red is a CSA validated residue.

Residues in uppercase are POOL predicted. Residues in BOLD are the consensus signature residues for each subgroup.

POOL provides more information of predicted residues to sort the superfamily. If there is more than one protein structure (row) with a POOL prediction of an important residue in a given spatial position (column) in the alignment, then that position is considered to be important in the

SALSA analysis (Figure 3.6). In the example SALSA table, subclasses 3 and 4, contain the spatially aligned positions in 1 and 2. The residues differ but have a common CSA residue is in position 4. Relying only on the CSA would find one position between subclasses 3 and 4, but

POOL provides two different positions with different resides. Consensus signatures can be represented in a final table (Figure 3.7). The consensus signatures can reported as the residue type in the spatial position. Residue numbers, correlating to the PDB ID, can be added in to the table.

Position number - Aligned spatial position Subclass PDB ID 1 2 3 4 5 Subclass 1 PDB A E E S Subclass 2 PDB E Y/F F Subclass 3 PDB H H D Subclass 4 PDB K C D

Figure 3.7 A hypothetical example table of consensus signatures.

3.4.8 Analysis of structural genomics proteins

The identification of structural genomics (SG) proteins in the superfamily can be found using a procedure similar to that used to generate the data set. One can use keyword searching in the

PDB or DALI structure searching; SG proteins also sometimes can be found in the PASS2 database. Once the SG proteins in the superfamily are identified, POOL calculations are run on

the 3D structures and then the predicted residues are structurally aligned to each subclass using

GLAT. The alignments are inspected for any gaps or misalignments. To quantify a match using a scoring method, the Normalized Match to Consensus Signatures (MCS) was developed.

3.4.9 Scoring the SALSA method using Normalized Match to Consensus Signatures (MCS)

For the final step, there is a need for a scoring method for SALSA to measure the quality of the match of the predicted functional site in a query protein to functional sites in proteins of known function. The automatic program used in SALSA is the Generating Scoring Matrices (GSM); the full background can be found in Dr. Yin’s dissertation.47 GSM uses the method Normalized

Match to Consensus Signatures (MCS). MCS calculates scores between pairs of local structures from the generated alignment table (GLAT), or between a predicted local active site and the consensus signature of a functional subclass. Scoring is based upon residue similarity matrices.

There are a variety of scoring matrices that take into account chemical similarity or evolutionary probability for similarity, like point accepted mutation (PAM)50, BLOcks SUbstitution Matrix

(BLOSUM)51 or an in-house Chemical Similarity Matrix (CSM). These substitution matrices describe the rate of one amino acid residue changing in to another amino acid residue over time.

Each of the twenty amino acids is assigned a probability of mutating into each of the 19 other amino acid types. The main question to ask is, does the change in the amino acid conserve the physical and chemical properties or will it disrupt structural and functional features of the protein? A 20 by 20 matrix is constructed with each element representing a probability of a given amino acid transforming into a specified amino acid. Using this similarity matrix, one can describe (quantitatively) the similarity between the residues in two aligned lists of amino acids.

The first matrix, PAM, was one of the first substitution matrices developed by Dayhoff in the

1970s, using the idea of point accepted mutation. The matrix estimates the rate of substitution

that would be expected if 1% of the amino acids had changed in closely related proteins (Figure

3.8). The matrix arranges similar amino acids near each other and along the diagonal are positive scores. The positive scores indicate conservative substitutions, amino acids that replace each other frequently. A negative score indicates infrequent replacement of certain pairs of amino acids. The second matrix, BLOSUM, used the BLOCKS database to identify conserved regions of protein families and found the relative frequencies of amino acids and their substitution probabilities. A log-odds score was calculated for each of the 210 possible substitution pairs of the 20 amino acids (Figure 3.9).

C 12 G -3 5 P -3 -1 6 S 0 1 1 1 A -2 1 1 1 2 T -2 0 0 1 1 3 D -5 1 -1 0 0 0 4 E -5 0 -1 0 0 0 3 4 N -4 0 -1 1 0 0 2 1 2 Q -5 -1 0 -1 0 -1 2 2 1 4 H -3 -2 0 -1 -1 -1 1 1 2 3 6 K -5 -2 -1 0 -1 0 0 0 1 1 0 5 R -4 -3 0 0 -2 -1 -1 -1 0 1 2 3 6 V -2 -1 -1 -1 0 0 -2 -2 -2 -2 -2 -2 -2 4 M -5 -3 -2 -2 -1 -1 -3 -2 0 -1 -2 0 0 2 6 I -2 -3 -2 -1 -1 0 -2 -2 -2 -2 -2 -2 -2 4 2 5 L -6 -4 -3 -3 -2 -2 -4 -3 -3 -2 -2 -3 -3 2 4 2 6 F -4 -5 -5 -3 -4 -3 -6 -5 -4 -5 -2 -5 -4 -1 0 1 2 9 Y 0 -5 -5 -3 -3 -3 -4 -4 -2 -4 0 -4 -5 -2 -2 -1 -1 7 10 W -8 -7 -6 -2 -6 -5 -7 -7 -4 -5 -3 -3 2 -6 -4 -5 -2 0 0 17 C G P S A T D E N Q H K R V M I L F Y W

Figure 3.8 PAM250 Matrix.50

A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C Q E G H I L K M F P S T W Y V

Figure 3.9 BLOSUM62 Matrix.51

The Chemical Similarity Matrix (CSM) was an in-house matrix developed and optimized by

Professor Mary Jo Ondrechen’s group, scoring the similarity of residues based on their chemical properties. The best similarity matrix for SALSA was the BLOSUM62. It was implemented in the MCS. MCS measures the proximity of the predicted active site of the query SG protein to those of proteins of known function or to the consensus signature. The score uses residue similarity probabilities from BLOSUM62 on the spatially aligned positions between the SG protein and the consensus signatures (Eq. 3.1).

A 6 R 0 6 N 0 -1 6 D 0 -1 0 6 C 0 0 0 0 6 Q 0 -1 4 0 0 6 E 0 -1 0 4 0 0 6 G 1 0 0 0 0 0 0 6 H 0 4 -1 -2 0 -1 -2 0 6 I 0 -2 -2 0 0 -2 0 0 -2 6 L 0 -2 -2 0 0 -2 0 0 -2 4 6 K 0 4 -1 -2 0 -1 -2 0 3 -2 -2 6 M 0 -2 -2 0 0 -2 0 0 -2 2 2 -2 6 F 0 -2 -2 -2 0 -2 -2 0 -2 0 0 -2 0 6 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 S 0 -2 -2 -2 0 -2 -2 0 -2 -2 -2 -2 -2 -2 0 6 T 0 -2 -2 -2 0 -2 -2 0 -2 -2 -2 -2 -2 -2 0 4 6 W 0 -2 -2 -2 0 -2 -2 0 -2 0 0 -2 -2 3 0 -2 -2 6 Y 0 -2 -2 -2 0 -2 -2 0 -2 0 0 -2 -2 4 0 -2 -2 3 6 V 0 -2 -2 0 0 -2 0 0 -2 3 3 -2 -2 0 0 -2 -2 0 0 6 A R N D C Q E G H I L K M F P S T W Y V

Figure 3.10 Chemical similarity matrix prepared by the ORG.

A normalized match to consensus signature score (MCS) is assigned to each query protein for each of the functional subgroups and the calculated score is divided by the score for the consensus signature aligned to itself (Eq. 3.2).

푆푆 = ∑푁퐴푃 푅푆 Eq. 3.1

푆푆 푀퐶푆 = 푄푃,퐶푆 Eq. 3.2 푆푆퐶푆,퐶푆

SS: the similarity score for each pair of proteins

RS: the residue score for each aligned position from scoring matrix

NAP: the number of aligned positions

SSCS,CS: the similarity score between the query protein (QP) and a consensus signature (CS)

SSCS,CS the score of a perfect match to a consensus signature

MCS: match-to-consensus signature score for the query protein

Figure 3.11 Overview of Generating Scoring Matrices (GSM) program. GSM is an automated procedure that assigns the score to a matrix for each of the proteins. This is covered in Dr. Yin’s thesis.47

The MCS scores provide a measure of the closeness of the query SG protein to the consensus signatures for each functional subclass. The higher the score is, the more similar the two corresponding proteins are at the predicted signature residues. In other words, these scores are a measure of the closeness between the local regions around the active sites of the two proteins.

3.5 The development of an automated SALSA analysis: SALSA-DT

To organize the SALSA method, various bioinformatics tools and servers were tested and added throughout the method development process. There was not one high-throughput approach that

performed the task of SALSA. Programs developed in-house, GLAT and GSM, automated parts of the SALSA process, but still left much work to be done manually. From the preliminary applications of the SALSA methodology, the need for an automated method became clear.

Through a collaboration with members of the College of Computer and Information Science

(CCIS) and the Department of Mathematics, an alternative approach to speeding up multiple structure alignments grew into SALSA-Delaunay Triangulation (SALSA-DT). The methods discussed previously were implemented into SALSA-DT. This work was performed together with Rohan Garg (CCIS), Liang Tian (Math), Jiajun Cao (CCIS), Professor Gene Cooperman

(CCIS) and Professor Alexandru Suciu (Math).

3.5.1 Automating the SALSA analysis

The current bottleneck in the SALSA method is performing an accurate structural alignment between members of the functional subclasses. The proposed solution is to focus on aligning the local active site, creating a unique 3D spatial representation of the active site that can be matched. The general approach is to use the POOL prediction of the set of highly ranked residues and represent them as unique points in 3D space, to form a graph representation of the active site. The graph representations can be matched rapidly to other active sites. Current structure alignments (Section 3.4.4) represent the entire structure in Cartesian coordinates and carry out global and local alignments of full length sequences taking into account residue similarity and algorithms use clustering methods to match the sequences. Since the active site is identified using POOL, matching only the active site residues is advantageous. One method to generate spatial representations is using the graph theory approach of Delaunay Triangulation

(DT). DT will generate topological graph descriptors of the active site residues.

3.5.2 Background of graphical representation

A graph is a model of a set of points in 3D-space with line segments connecting the points.

Mathematically, a graph consists of two finite sets V and E (Eq. 3.3). The elements of V are called vertices (or nodes) and the elements of E are called edges. Each edge is contains a set of one or two vertices called endpoints. An edge joins its endpoints.52

퐺 = (푉, 퐸) Eq. 3.3

Figure 3.12 Line drawings of Graph A.

The vertex and edge sets of the simple graph A (Figure 3.12) are given as:

푉퐴 = {푝, 푞, 푟, 푠}

퐸퐴 = {푝푞, 푝푟, 푝푠, 푟푠, 푞푠}

A formal specification lists a row for each vertex containing the list of neighbors of that vertex.

p : q r s

q : p s

r : p s

s : p q r

Geometric drawing of graphs uses exact coordinates of an image of the vertices and edges of a graph which includes the lengths and area formed; in three dimensions these edges form tetrahedra.

A 3D triangulation starts with vertices in 3D space, and adds edges so that every vertex is part of at least one tetrahedron. These types of graphs are useful in the applications of pattern recognition and identifying networks.53

3.5.3 Application of Delaunay Triangulation and graph representations to SALSA

In a PDB file, each amino acid residue is currently represented by the (x,y,z) coordinates of all of its heavy atoms; the (x,y,z) coordinates of the alpha carbon atoms are typically used if one wishes to specify each residue with a single point in space. The coordinates of the alpha carbon atoms constitute the input to the Delaunay Triangulation. Additional residue-specific chemical properties can also be specified, increasing the dimension to four or five or higher. Delaunay

Triangulations convert a geometric description of the protein into a topological graph-like description.52 The geometric positions of the residues of the protein are used to produce a graph with each graph vertex corresponding to a residue (Figure 3.13).

An edge between two vertices in the graph tends to indicate that the corresponding residues in the protein are near each other. The results of a Delaunay Triangulation is that for each tetrahedron in that triangulation, if one draws the circumscribed sphere going through the vertices of that tetrahedron, then there will be no additional vertices inside the sphere.

Mathematically this is the sense in which, if there is an edge between two vertices (alpha carbon), then those vertices (alpha carbons) tend to be near each other.

Added weights (for distance between residues and chemical similarity of residues) can be used to emphasize the importance of individual vertices in the triangulation. POOL predicted residues will have a higher weight applied to the Delaunay vertices corresponding to those residues.

Once tetrahedra (containing the highest ranked POOL residues) are prepared for each protein, they can be compared to one another.

Figure 3.13 Simplified example of Delaunay Triangulation for a 2D protein.

3.5.4 Implementation of Delaunay Triangulation

The SALSA-DT algorithm has two parts, preprocessing and pairwise matching, the pairwise consisting of two steps (Figure 3.14).

Figure 3.14 Overview of SALSA-DT for matching pairs of proteins.

The preprocessing step performs the Delaunay Triangulation on the protein structures resulting in a set of tetrahedra. For a pair of proteins, we next invoke a pairwise matching algorithm. The

pairwise matching algorithm begins by identifying seeds between two protein’s tetrahedra, discriminating by top POOL ranked amino acids. To match two proteins, we need a starting point at each protein. We choose a pair of tetrahedra, one from each protein, as the starting point.

These are called seeds. The pairwise matching algorithm tries all possible pairs of seeds, restricted by considerations of residue similarity. For each pair of seeds, the algorithm extends the match to as many tetrahedra as possible. The pair of seeds that leads to the best match is returned.

We discriminate not only by residue similarity but also by POOL ranked residues and other criteria.

Preprocessing

The algorithm performs Delaunay Triangulation on the entire protein structure (P1) using the

54 Quickhull (Qhull) program. The PDB coordinates become a set of tetrahedra (t). A vertex (ri) represents the alpha carbon from the amino acid on the protein’s backbone (Figure 3.15A), i represents a sequence number. Only the tetrahedra in the active site need to be retained in the match (Figure 3.15B). The active site vicinity is determined by the top 50 ranked POOL residues. Any vertices adjacent (ai) to these POOL residues (along with tetrahedra that contain those residues) are added back in the subset during pre-processing.

Next, we describe the Pairwise Matching Algorithm (Figure 3.16) that takes information from the preprocessing to incorporate later in the Delaunay extension portion of the algorithm.

Figure 3.15 An example of Delaunay triangulation for protein structures in the preprocessing step. Panel A shows triangulation of the overall protein structure. In Panel B, the triangulation of the top POOL ranked residues are in blue. This forms the tetrahedra that will be used to represent the active site for alignment.

Figure 3.16 The overview of pairwise matching, described in detail. In the first step of the matching, the top POOL ranked residues will be used to identify seeds by using a seed pair matching initially. The adjacent vertices identified in DT will be used in the second step to increase the matching area.

Pairwise Matching

Two proteins (P1 and P2) may have more than one match. We explored different matches in part by beginning with different seed pairs (pairs of tetrahedra from each protein). The list of 50 top ranked POOL residues (ri) from P1 and residues (r’i) from P2 are used to identify good seed pairs

(ri, r’i) where i is any residue number and many of the ri are highly ranked (Figure 3.17).

Figure 3.17 An example of Seed Pair matching in 2D. Three residues (in yellow) that are top

POOL rank of eight residues (in black) are picked out to be matched. The residues are represented as [P1: r1, r2, r8] and [P2: r’1, r’2, r’8] creating a tetrahedra (in shaded blue).

To discriminate among the matches, the residues of similar chemistry are also tested (Figure

3.18). If the residues chemical similarity does not match, that seed pair is rejected.

Figure 3.18 Example of s3eed pair matching by chemical similarity. The chemical type for each residue (amino acid) is compared between P1 and P2. Matching tetrahedra are shaded blue.

In a third level of discrimination (after using POOL ranking and chemical similarity) distance between the alpha carbons is taken into account. If (ri, rj) is an edge of P1, and if r’i r’j) is a matching edge of P2, then we compare the distances dist(ri, rj) and dist(r’i, r’j).

Figure 3.19 The lengths of the edge are included in the seed pair matching. The length of edges for P1: a1,2 and a2,8 are compared to the length of the edges for P2: a’1,2 and a’2,8 because they are found on the tetrahedra. If the two distances differ by more than 2 or 3 Å than the corresponding match is rejected. The matching tetrahedra is shaded in blue.

In using the Pairwise Matching algorithm extending to a match, all three criteria are used (POOL ranking, chemical similarity and residue distances). Once the seed pairs are described, a further match can be extended which is called Delaunay Extension. The seed pairs from the previous step, along with their adjacent vertices, are combined as a tetrahedra.

In the implementation, we are matching tetrahedra instead of edges. For this reason, instead of discrimination between regions, we discriminate based on the volume of a tetrahedron and the sum of lengths of a tetrahedron. Finally one can discriminate between matching tetrahedra by determining if two tetrahedra have the same orientation.

Currently the implementation rejects matching volumes if they differ by 14.4 cubic Angstroms.

14.4 is the average volume for tetrahedra (t) calculated for the superfamily. It also rejects matching tetrahedra if the sum of lengths differ by 9.6 Angstroms. 9.6 is the average length of tetrahedra calculated for the superfamily. For each pair of tetrahedra, if (ti) from P1 and (tj) from

P2, such that ti has ri as one of its vertices, and tj has rj as one of its vertices, do not match in the correct orientation the pair is rejected. If the volumes of ti and tj differ by more than 14.4 the pair is rejected. 14.4 is the average volume for tetrahedra calculated for the superfamily. If the sum of the lengths of edges ti and tj differ by more than 9.6, the pair is rejected. 9.6 is the average length of tetrahedra calculated for the superfamily.

The general list of criteria for the program was set up below by Rohan Garg. For the matched vertex pair (vi, vj) of ti and tj:

1. If vi or vj is among the top 11 POOL-ranked residue in P1 and P2, respectively, and vi is

not chemically simliar to vj, the pair of tetrahedra is rejected and we move on to the next

pair.

2. If vi or vj is among the top 24 POOL-ranked residue in their respective proteins, [POOL

rank of vi] – [POOL rank of vj] > 24, the pair of tetrahedra is rejected and we move on to

the next pair.

3. If vi or vj is among the top 10 POOL-ranked residue in their respective proteins, [POOL

rank of vi] – [POOL rank of vj] > 10, the pair of tetrahedra is rejected and we move on to

the next pair.

4. If vi or vj is among the top 3 POOL-ranked residue in their respective proteins, [POOL

rank of vi] – [POOL rank of vj] > 1, the pair of tetrahedra is rejected and we move on to

the next pair.

The discrimination based on chemical similarity used the chemical matching score derived from the values from the BLOSUM62 matrix. The final match of subgraphs for the two proteins includes matching residues and matching tetrahedra. The criteria described above generated a total match score of all matching residues and matching tetrahedra. Since each seed pair produced a different match, the best matching score among all seed pairs is chosen.

To optimize this process, more weights for distance and chemical similarity can be used to emphasize the importance of some vertices in the triangulation. The method is not perfected yet but can be tested on superfamilies of proteins.

3.6 Summary

This chapter introduced the molecular modeling tools used in protein function annotation and overview of the SALSA methodology. In the next chapter, the method is applied to the Ribulose

Phosphate Binding Barrel Superfamily. Manual alignment and sorting of the superfamily is successful for the proteins of known function in the superfamily and results of the graph-based automated method using Delaunay Triangulation are promising. Predictions are also made of the functions of SG members of the superfamily.

References

1. Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov,

I. N.; Bourne, P. E., The Protein Data Bank. Nucleic Acids Res 2000, 28 (1), 235-42.

2. Berman, H. M.; Bhat, T. N.; Bourne, P. E.; Feng, Z.; Gilliland, G.; Weissig, H.; Westbrook,

J., The Protein Data Bank and the challenge of structural genomics. Nat Struct Biol 2000, 7

Suppl, 957-9.

3. Terwilliger, T. C.; Stuart, D.; Yokoyama, S., Lessons from structural genomics. Annu Rev

Biophys 2009, 38, 371-83.

4. Service, R. F., Structural biology. Protein structure initiative: phase 3 or phase out. Science

2008, 319 (5870), 1610-3.

5. Protein Structure Initiative (PSI). http://www.nigms.nih.gov/Research/FeaturedPrograms/PSI/.

6. (a) Moult, J.; Melamud, E., From fold to function. Curr Opin Struct Biol 2000, 10 (3), 384-9;

(b) Chothia, C.; Lesk, A. M., The relation between the divergence of sequence and structure in proteins. EMBO J 1986, 5 (4), 823-6.

7. Somarowthu, S.; Yang, H.; Hildebrand, D. G.; Ondrechen, M. J., High-performance prediction of functional residues in proteins with machine learning and computed input features.

Biopolymers 2011, 95 (6), 390-400.

8. Chruszcz, M.; Domagalski, M.; Osinski, T.; Wlodawer, A.; Minor, W., Unmet challenges of structural genomics. Curr Opin Struct Biol 2010, 20 (5), 587-97.

9. Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.

J., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Nucleic Acids Res 1997, 25 (17), 3389-402.

10. Pearson, W. R., Rapid and sensitive sequence comparison with FASTP and FASTA.

Methods Enzymol 1990, 183, 63-98.

11. Pearson, W. R., Using the FASTA program to search protein and DNA sequence databases. Methods Mol Biol 1994, 24, 307-31.

12. Lichtarge, O.; Bourne, H. R.; Cohen, F. E., An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 1996, 257 (2), 342-58.

13. Ogiwara, A.; Uchiyama, I.; Seto, Y.; Kanehisa, M., Construction of a dictionary of sequence motifs that characterize groups of related proteins. Protein Eng 1992, 5 (6), 479-88.

14. Sigrist, C. J.; Cerutti, L.; Hulo, N.; Gattiker, A.; Falquet, L.; Pagni, M.; Bairoch, A.;

Bucher, P., PROSITE: a documented database using patterns and profiles as motif descriptors.

Brief Bioinform 2002, 3 (3), 265-74.

15. Punta, M.; Coggill, P. C.; Eberhardt, R. Y.; Mistry, J.; Tate, J.; Boursnell, C.; Pang, N.;

Forslund, K.; Ceric, G.; Clements, J.; Heger, A.; Holm, L.; Sonnhammer, E. L.; Eddy, S. R.;

Bateman, A.; Finn, R. D., The Pfam protein families database. Nucleic Acids Res 2012, 40

(Database issue), D290-301.

16. Hunter, S.; Jones, P.; Mitchell, A.; Apweiler, R.; Attwood, T. K.; Bateman, A.; Bernard,

T.; Binns, D.; Bork, P.; Burge, S.; de Castro, E.; Coggill, P.; Corbett, M.; Das, U.; Daugherty, L.;

Duquenne, L.; Finn, R. D.; Fraser, M.; Gough, J.; Haft, D.; Hulo, N.; Kahn, D.; Kelly, E.;

Letunic, I.; Lonsdale, D.; Lopez, R.; Madera, M.; Maslen, J.; McAnulla, C.; McDowall, J.;

McMenamin, C.; Mi, H.; Mutowo-Muellenet, P.; Mulder, N.; Natale, D.; Orengo, C.; Pesseat, S.;

Punta, M.; Quinn, A. F.; Rivoire, C.; Sangrador-Vegas, A.; Selengut, J. D.; Sigrist, C. J.;

Scheremetjew, M.; Tate, J.; Thimmajanarthanan, M.; Thomas, P. D.; Wu, C. H.; Yeats, C.;

Yong, S. Y., InterPro in 2011: new developments in the family and domain prediction database.

Nucleic Acids Res 2012, 40 (Database issue), D306-12.

17. Rost, B., Twilight zone of protein sequence alignments. Protein Eng 1999, 12 (2), 85-94.

18. Hsiao, T.-L.; Revelles, O.; Chen, L.; Sauer, U.; Vitkup, D., Automatic policing of biochemical annotations using genomic correlations. Nat Chem Biol 2010, 6 (1), 34-40.

19. Murzin, A. G.; Brenner, S. E.; Hubbard, T.; Chothia, C., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247 (4),

536-40.

20. Orengo, C. A.; Michie, A. D.; Jones, S.; Jones, D. T.; Swindells, M. B.; Thornton, J. M.,

CATH--a hierarchic classification of protein domain structures. Structure 1997, 5 (8), 1093-108.

21. Hawkins, T.; Kihara, D., Function prediction of uncharacterized proteins. J Bioinform

Comput Biol 2007, 5 (1), 1-30.

22. Sael, L.; Chitale, M.; Kihara, D., Structure- and sequence-based function prediction for non-homologous proteins. J Struct Funct Genomics 2012, 13 (2), 111-23.

23. Shindyalov, I. N.; Bourne, P. E., Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 1998, 11 (9), 739-47.

24. Holm, L.; Rosenstrom, P., Dali server: conservation mapping in 3D. Nucleic Acids Res

2010, 38 (Web Server issue), W545-9.

25. Ilyin, V. A.; Abyzov, A.; Leslin, C. M., Structural alignment of proteins by a novel

TOPOFIT method, as a superimposition of common volumes at a topomax point. Protein Sci

2004, 13 (7), 1865-74.

26. Thompson, K. E.; Wang, Y.; Madej, T.; Bryant, S. H., Improving protein structure similarity searches using domain boundaries based on conserved sequence information. BMC structural biology 2009, 9, 33.

27. Laskowski, R. A.; Watson, J. D.; Thornton, J. M., ProFunc: a server for predicting protein function from 3D structure. Nucleic Acids Res 2005, 33 (Web Server issue), W89-93.

28. Kinoshita, K.; Nakamura, H., Identification of the ligand binding sites on the molecular surface of proteins. Protein Sci 2005, 14 (3), 711-8.

29. Redfern, O. C.; Dessailly, B. H.; Dallman, T. J.; Sillitoe, I.; Orengo, C. A., FLORA: a novel method to predict protein function from structure in diverse superfamilies. PLoS Comput

Biol 2009, 5 (8), e1000485.

30. Ferre, F.; Ausiello, G.; Zanzoni, A.; Helmer-Citterich, M., SURFACE: a database of protein surface regions for functional annotation. Nucleic Acids Res 2004, 32 (Database issue),

D240-4.

31. Binkowski, T. A.; Naghibzadeh, S.; Liang, J., CASTp: Computed Atlas of Surface

Topography of proteins. Nucleic Acids Res 2003, 31 (13), 3352-5.

32. Laurie, A. T.; Jackson, R. M., Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics 2005, 21 (9), 1908-16.

33. Porter, C. T.; Bartlett, G. J.; Thornton, J. M., The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res 2004,

32 (Database issue), D129-33.

34. Hall, D. R.; Kozakov, D.; Vajda, S., Analysis of protein binding sites by computational solvent mapping. Methods Mol Biol 2012, 819, 13-27.

35. Parasuram, R.; Lee, J. S.; Yin, P.; Somarowthu, S.; Ondrechen, M. J., Functional classification of protein 3D structures from predicted local interaction sites. J Bioinform Comput

Biol 2010, 8 Suppl 1, 1-15.

36. Gerlt, J. A.; Babbitt, P. C., Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. Annu Rev Biochem 2001, 70, 209-

46.

37. Allewell, N. M., Thematic Minireview Series on Enzyme Evolution in the Post-genomic

Era. J Biol Chem 2012, 287 (1), 1-2.

38. Gandhimathi, A.; Nair, A. G.; Sowdhamini, R., PASS2 version 4: An update to the database of structure-based sequence alignments of structural domain superfamilies. Nucleic

Acids Res 2012, 40 (Database issue), D531-4.

39. Krieger, E.; Joo, K.; Lee, J.; Lee, J.; Raman, S.; Thompson, J.; Tyka, M.; Baker, D.;

Karplus, K., Improving physical realism, stereochemistry, and side-chain accuracy in homology modeling: Four approaches that performed well in CASP8. Proteins 2009, 77 Suppl 9, 114-22.

40. Ondrechen, M. J.; Clifton, J. G.; Ringe, D., THEMATICS: a simple computational predictor of enzyme function from structure. Proc Natl Acad Sci U S A 2001, 98 (22), 12473-8.

41. Capra, J. A.; Laskowski, R. A.; Thornton, J. M.; Singh, M.; Funkhouser, T. A., Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure.

PLoS Comput Biol 2009, 5 (12), e1000585.

42. Sankararaman, S.; Sjolander, K., INTREPID--INformation-theoretic TREe traversal for

Protein functional site IDentification. Bioinformatics 2008, 24 (21), 2445-52.

43. Krissinel, E.; Henrick, K., Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr 2004, 60 (Pt 12 Pt

1), 2256-68.

44. Guda, C.; Lu, S.; Scheeff, E. D.; Bourne, P. E.; Shindyalov, I. N., CE-MC: a multiple protein structure alignment server. Nucleic Acids Res 2004, 32 (Web Server issue), W100-3.

45. Armougom, F.; Moretti, S.; Poirot, O.; Audic, S.; Dumas, P.; Schaeli, B.; Keduas, V.;

Notredame, C., Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee. Nucleic Acids Res 2006, 34 (Web Server issue), W604-8.

46. Leslin, C. M.; Abyzov, A.; Ilyin, V. A., TOPOFIT-DB, a database of protein structural alignments based on the TOPOFIT method. Nucleic Acids Res 2007, 35 (Database issue), D317-

21.

47. Yin, P. Computational studies of enzyme function and dynamics. Northeastern University,

Boston, MA, 2012.

48. Pettersen, E. F.; Goddard, T. D.; Huang, C. C.; Couch, G. S.; Greenblatt, D. M.; Meng, E.

C.; Ferrin, T. E., UCSF Chimera--a visualization system for exploratory research and analysis. J

Comput Chem 2004, 25 (13), 1605-12.

49. Krieger, E.; Koraimann, G.; Vriend, G., Increasing the precision of comparative models with YASARA NOVA--a self-parameterizing force field. Proteins 2002, 47 (3), 393-402.

50. Dayhoff, M. O.; Schwartz, R.; Orcutt, B. C., Atlas of Protein Sequence and Structure.

National Biomedical Research Foundation: Washington, 1978; Vol. 3.

51. Henikoff, S.; Henikoff, J. G., Amino acid substitution matrices from protein blocks. Proc

Natl Acad Sci U S A 1992, 89 (22), 10915-9.

52. Gross, J. L.; Yellen, J., Graph Theory and Its Applications, Second Edition (Discrete

Mathematics and Its Applications). Chapman \\& Hall/CRC: 2005.

53. Diestel, R., Graph Theory. Springer: 2010.

54. Barber, C. B.; Dobkin, D. P.; Huhdanpaa, H., The Quickhull algorithm for convex hulls.

Acm T Math Software 1996, 22 (4), 469-483.

Chapter 4

Applying SALSA and SALSA-DT to the Ribulose Phosphate

Binding Barrel Superfamily (RPBB)

4.1 Introduction of the Ribulose-phosphate binding barrel superfamily

The Ribulose-Phosphate Binding Barrel (RPBB) superfamily has enzymes important in various types of metabolism. The Structural Classification of Proteins (SCOP) database1 gives the RPBB superfamily identification number 51366 and describes the structural features as a parallel (β/α)- barrel fold consisting of an eight-stranded parallel β-barrel surrounded by eight α-helices, containing a phosphate binding site (Figure 4.1).2

Figure 4.1 Cartoon sketch of the overall shape of members from the RPBB superfamily. A) The enzymes from this group are found active in dimer form at the dimer interface. The active site is located near the 7th and 8th beta-strands. Residues from each monomer contribute to catalysis. B)

The monomer form shows the signature eight β-strands. C) A 2D-cartoon of the eight β-strands extended with the correlating colors of the 3D fold.2

The conserved residues and the specific phosphate-binding site for this superfamily are located in two glycine-rich regions located near the ends of the 7th and 8th β-strands. The members of this superfamily catalyze diverse reactions with various substrates which result in different products

(Figure 4.2). The subclasses within the RPBB superfamily have been shown to be evolutionarily related, having a common ancestral precursor.2

4.1.1 Understanding the classification of subclasses

Using the SCOP1, PASS23 and SFLD4 databases, the superfamily was divided into eight subclasses (Table 4.1). These subclasses have different biological functions, substrates, different reaction mechanisms but a similar overall shape. The biological significance of enzymes in this superfamily is that they are important for metabolism and are essential in various metabolic pathways. The proteins found in each subclass are listed in Table 4.2; the biochemical function of each protein has been experimentally verified.

Table 4.1 Overview of enzymes functional types found in the RPBB superfamily.

Subclass Abbreviation Biological significance Indole-3-glycerol phosphate synthase IGPS Biosynthesis of tryptophan Tryptophan synthase TrpA Biosynthesis of tryptophan Phosphoribosyl anthranilate isomerases PRAI Biosynthesis of tryptophan Phosphoribosylformimino-5- aminoimidazole carboxamide ribotide HisA/HisF Biosynthesis of histidine isomerase/ imidazoleglycerolphosphate synthase Pentose phosphate cycle Ribulose-phosphate 3-epimerase RPE (Nonoxidative), product has a role in gene expression Orotidine 5′-monophosphate OMPDC Pyrimidine metabolism decarboxylase Keto-3-gulonate-6-phosphate Pentose and glucuronate KGPDC decarboxylase interconversions Ribulose monophosphate Hexulose phosphate synthase HPS (RuMP) pathway

Table 4.2 List of protein structures available in the PDB with functional information. In the subclass HPS, the P42405 structure is a homology model from the UniProt P42405 sequence.

Subclass PDB Species Resolution Date 1pii:A Escherichia coli 2.00 Å 1992 IGPS 1i4n Thermotoga maritima 2.50 Å 2002 2c3z Sulfolobus solfataricus 2.80 Å 2005 1geq Pyrococcus furiosus 2.00 Å 2001 Salmonella enterica subsp. enterica serovar 1qop 1.40 Å 1999 TrpA Typhimurium 1xc4 Escherichia coli 2.80 Å 2004 1rd5 Zea mays 2.02 Å 2005 1pii:B Escherichia coli 2.00 Å 1992 PRIA 1lbm Thermotoga maritima 2.80 Å 2002 1qo2 Thermotoga maritima 1.85 Å 2000 1vzw Streptomyces coelicolor 1.80 Å 2005 2y85 Mycobacterium tuberculosis H37Rv 2.40 Å 2011 HisA/HisF 1thf Thermotoga maritima 1.45 Å 2000 1h5y Pyrobaculum aerophilum 2.00 Å 2001 1ox6 Saccharomyces cerevisiae 2.40 Å 2003 1rpx Solanum tuberosum 2.30 Å 1999 2fli Streptococcus pyogenes 1.80 Å 2006 RPE 1h1y Oryza sativa 1.87 Å 2003 1tqj Synechocystis sp. 1.60 Å 2004 3ovp Homo sapiens 1.70 Å 2011 1dbt Bacillus subtilis 2.40 Å 2000 1dv7 Methanothermobacter thermautotrophicus 1.80 Å 2000 1dqw Saccharomyces cerevisiae 2.10 Å 2000 OMPDC 1l2u Escherichia coli 2.50 Å 2002 2za1 Plasmodium falciparum 2.65 Å 2008 3qw3 Leishmania infantum 1.70 Å 2011 3l0k Homo sapiens 1.34 Å 2009 1xbv Escherichia coli 1.66 Å 2005 KGPDC 3exr Streptococcus mutans 1.70 Å 2009 3ajx Mycobacterium gastri 1.60 Å 2010 HPS P42405 Bacillus subtilis (strain 168) Homology model 2010

Figure 4.2 Simple overview of various reactions found in the RPBB superfamily. Differences can be found in the starting reactants and end products. The common region of each small molecule substrate is the phosphate group that interacts with the phosphate binding region found in these enzymes specific to the RPBB superfamily.

Targeting the enzymes in the histidine biosynthetic pathway is a potential area of therapeutics because only bacteria, fungi, and plants possess this anabolic pathway. Currently, there are no drugs available targeting these enzymes.5 Protein design and engineering is a young field that relies on understanding the evolutionary relationships through structure and function to make changes for improved proteins successful. The relationships between subclasses in the RPBB are of interest because the enzymes in the subclasses catalyze similar reactions and contain a common binding site for a phosphate group.

4.1.2 Background of structure and function of enzymes in the RPBB superfamily

In this section, the active site of each subclass will be investigated to understand evolutionary relationships and to develop a consensus signature for the SALSA method. In the biosynthesis of the amino acids tryptophan and histidine, they include successive reactions that are categorized together. Other reactions revolve around the catabolism of important substrates for metabolic pathways.

4.1.2.1 Biosynthesis of tryptophan

Three enzymes from the RPBB are involved in consecutive order in the biosynthesis of tryptophan (Figure 4.3).6 The 4th step involves phosphoribosyl anthranilate isomerase (PRAI) which interconverts N-(5’-phosphoribosyl) anthranilate (PRA) to 1’-(2’-carboxyphenylamino)-

1’-deoxyribulose-5’-phosphate (CdRP).7 The 5th step involves indole-3-glycerol phosphate synthase (IGPS) to catalyze the ring closure of 1'-(2'-carboxyphenylamino)-1'-deoxyribulose-5'- phosphate (CdRP) to the indole-3-phosphate (I3P).8 Tryptophan synthase α-chain/α-subunit

(TrpA) catalyzes the cleavage of indole-3-phosphate (I3P) to indole and glyceraldehyde-3- phosphate (G3P).9 These three enzymes contain the common phosphate binding site, have

similar substrates and the (βα)8-barrel fold which suggest they have evolved from a common ancestor.10

IGPS is active as a separate monomeric enzyme in most organisms but can be found as a bifunctional enzyme with PRAI in Escherichia coli (E. coli).11 The bifunctional enzyme has an overall shape of dumb-bell and is large for a monomeric enzyme. The N-terminal domain catalyzes the IGPS reaction and the C-terminal domain catalyzes the PRAI reaction. The crystal structure, PDB 1pii from E. coli, contains the two domains for IGPS and PRAI.7

Figure 4.3 Overall structure similarity found between the IGPS, PRAI and TrpA subclasses. A)

The bifunctional crystal structure (1pii) from the PDB shows the N-terminal domain of IGPS (in orange) and the C-terminal domain PRAI (in green). B) Three crystal structures of E. coli structurally aligned and β-strands numbered; some helices overlap better than others.

The structure was divided into two domains, IGPS (residues M1-L254) identified further as

1pii:A and PRAI (residues G255-Y447) identified further as 1pii:B. The structural alignment of

IGPS, PRAI and TrpA is very similar (Figure 4.3). To compare the overall β/α-barrel structures, pairwise alignment of the main-chain atoms was measured using MUSTANG12 in YASARA13.

The RMSD values for all superimposed main-chain atoms are 1.823 Å for the pair PRAI/IGPS

(138 aligned residues) with 13.04% sequence identity, 2.032 Å for the pair IGPS/TrpA(a) (146 aligned residues) with 17.12% sequence identity, and 1.988 Å for the pair PRAI/TrpA(a) (119 aligned residues) with 18.49% sequence identity. The central barrel, comprised of eight beta strands, aligns well. The only variability occurs with the helices for IGPS and TrpA. A few structural differences can be seen between the three enzymes. IGPS contains an N-terminal alpha helix (residues M1-Q23) extension that is found to be essential for substrate binding.14

TrpA contains an N-terminal alpha helix (residues M1-K15) that is 14 residues longer than the

PRAI enzyme. The known active site residues from each subclass are listed in Table 4.3, collected from literature and the Catalytic Site Atlas (CSA).

In the phosphoribosyl anthranilate isomerase (PRAI) subclass, there are two reported proteins with a solved crystal structure and experimental information available from E. coli (PDB:

1pii_B) and Thermotoga maritima (PDB: 1lbm). The CSA reports two catalytic residues, a cysteine which acts as the general base and an aspartate that acts as the general acid (Table 4.3).

The reaction mechanism of PRAI (Figure 4.4) shows the interactions of cysteine and aspartate participating in an Amadori rearrangement on the substrate.

For the indole-3-glycerol phosphate synthase (IGPS) subclass, there are three reported proteins with a solved crystal structure and experimental information available from E. coli (PDB:

1pii:B), Thermotoga maritime (PDB: 1i4n) and Sulfolobus solfataricus (PDB:2c3z). The CSA

reports six catalytic residues, (Table 4.3), a lysine** acts as the general acid and a glutamate** acts as a general base for catalytic activity. Lysine* and glutamate* create salt bridges with the anthranilate carboxyl group and hydroxyl group of the substrate. The asparagine and serine form a hydrogen bond to glutamate** to position the molecule in the active site (Figure 4.5).

Table 4.3 The reported catalytic residues from the CSA for the subclasses PRIA, IGPS and

TrpA. Symbols used to denote different amino acids, for example in IGPS there are two different groups of glutamates, E53/E47/E51 and E163/E157/E159. E53/E47/E51 is noted as glutamate* and E163/E157/E159 glutamate**. The same follows for lysine in IGPS.

PRAI 1pii:B C260 D379 1lbm C7 D126

IGPS * * ** ** 1pii:A E53 K55 K114 E163 N184 S215 1i4n E47 K49 K108 E157 N179 S210 2c3z E51 K53 K110 E159 N180 S211

TrpA 1qop E49 D60 1geq E36 D47 1xc4 E49 D60 1rd5 E50 D61

Figure 4.4 The general mechanism for phosphoribosyl anthranilate isomerase (PRAI). PRA undergoes an Amadori rearrangement to form 1’-(2’-carboxyphenylamino)-1’-deoxyribulose-5’-phosphate (CdRP). The furanose ring oxygen of the aminoaldose is protonated by general acid (HA), an aspartate, which cleaves the carbon-oxygen bond to open the ring, forming a Schiff Base “imine” intermediate. The resulting imine is deprotonated by a general base (B−), a cysteine, at the C2 atom of its ribose moiety to form an enolamine. The tautomer changes into the keto form of CdRP.15 The conversion of the enolamine form of CdRP to the keto form occurs spontaneously and is independent of the enzyme concentration.16

Figure 4.5 The general mechanism for indole glycerol phosphate synthase (IGPS). The IGPS reaction undergoes condensation, decarboxylation and dehydration reactions.17 A lysine (general acid), through a salt bridge with the anthranilate carboxyl group and hydroxyl group, moves the carbonyl of CdRP group close to a proton-donating group. Lysine activates the carbonyl group to promote a carbon-carbon bond formation through electrophilic attack on the pi-electrons of the benzene ring. A negatively charged proton- accepting group of glutamate (general base), form a stabilizing hydrogen bond to the nitrogen atom to form Intermediate 1. A decarboxylation occurs with the movement of electrons to form Intermediate 2. The dehydration of Intermediate 2 occurs through the lysine, which was restored to an acid via the salt bridge cluster of lysines in the active site. A proton is donated to the hydroxyl group and a water molecule is eliminated. The base accepts a proton from Intermediate 2 to generate the pyrrole ring and end product, IGP.

The general acid in this mechanism is a lysine and the general base in this mechanism is a glutamate.18

Tryptophan synthase (TrpA) catalyzes the “” steps in the synthesis of tryptophan. TrpA contains two catalytically active subunits, alpha and beta. The alpha subunit catalyzes the cleavage of indolyl-glycerol-3-phosphate (IGP) to give indole and 3-phosphoglyceraldehyde (G3P). In the

TrpA subclass, there are four reported proteins with solved crystal structures and experimental functional information available: these are from Pyrococcus furiosus (PDB: 1geq), Salmonella enterica subsp. enterica serovar Typhimurium (PDB: 1qop), Escherichia coli (PDB: 1xc4) and

Zea mays (PDB: 1rd5). The CSA reports two catalytic residues (Table 4.3), the aspartate acts as a general base to deprotonate the nitrogen atom of the indole ring. The glutamate protonates the hydroxyl group of the glyceraldehyde moiety to form the aldehyde product and indole moiety

(Figure 4.6).

These three enzymes are essential in the biosynthesis of tryptophan, they contain the common phosphate binding site along the 7th and 8th β-strands. Each substrate contains a phosphate moiety to serve as an anchor in the binding site to prepare for catalysis. The CSA yields only a handful of active site residues for these groups but the POOL method will help predict more residues in the active site that can better sort these subclasses.

4.1.2.2 Biosynthesis of histidine

The phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (HisA) and imidazoleglycerolphosphate synthase (HisF) catalyze successive (3rd and 4th) steps in the biosynthesis of histidine. HisA and HisF are β/α barrel enzymes that contain a two-fold repeat structure, each half-barrel of these two enzymes contains a phosphate-binding motif region for the biphosphate substrates, Pro-FAR and PRFAR.19

Figure 4.6 The general mechanism for tryptophan synthase (TrpA). The TrpA mechanism is an aldol-cleavage like reaction. The glutamate acts both as a proton donor and proton acceptor. Tautomerization of the indole ring of IGP is facilitated by the carboxylate group of an aspartate (B2). Deprotonation of the hydroxyl group of the intermediate by a glutamate (B3), results in the cleavage of the bond between indole and G3P and the formation of G3P.9

HisA converts N’-[(5’-phosphoribosyl)-formimino]-5-aminoimidazol-4-carboxamid ribonucleotide (ProFAR) into (N-(5’-phospho-D-1’-ribulosylformimino)-5-amino-1-(5”- phosphoribosyl)-4-imidazolecarboxamide (PRFAR) through an Amadori rearrangement, similar to subclass PRAI (Figure 4.7).15 PRFAR will bind to the active site of imidazole glycerol phosphate synthase (ImGPS). ImGPS converts PRFAR to imidazole glycerol phosphate (ImGP) and 5-aminoimidazole-4-carboxamide ribonucleotide (AICAR). ImGP continues in the histidine biosynthesis pathway and AICAR is used in the de novo synthesis of purines.20 ImGPS catalyzes two coupled reactions, the glutaminase reaction (Eq. 4.1) and the synthase reaction (Eq. 4.2).21

Glutamine + H2O  glutamic acid + NH3 (Eq. 4.1)

PRFAR + NH3  ImGP + AICAR (Eq. 4.2)

In yeast, ImGPS contains both catalytic activities are they are fused on one polypeptide. In bacteria the ImGPS forms a bifunctional enzyme consisting of a glutaminase subunit (HisH) and a cyclase/synthase subunit (HisF). (See Figure 4.7) The binding of PRFAR stimulates the glutaminase activity of HisH.22 The cyclase/synthase HisF reaction is ammonia-dependent. It is proposed that ammonia is transported between the active sites of HisH and HisF through a channel.5 Structurally, the conserved phosphate binding region is found on the 7th and 8th β- strand but the second is found on the 3rd β-strand.19

The general mechanism for HisA/HisF is a two-step process. First, the furanose ring oxygen atom of ProFAR is protonated by a general acid (HA) and cleaves the carbon-oxygen bond to open the ring, forming a Schiff Base “imine” intermediate (Figure 4.8). The resulting imine is deprotonated by a general base (B−) at the C atom of its ribose moiety to form an enolamine. The tautomer changes into the keto form PRFAR. HisF uses ammonia and PRFAR to catalyze the

closure of the imidazole ring to generate imidazole glycerol phosphate (ImGP) and 5- aminoimidazole-4-carboxamide ribonucleotide (AICAR). The addition of ammonia from HisH

(Figure 4.9), attacks the carbonyl group next to the phosphoglycerol, with subsequent imine formation. Water hydrolyzes the imine group separating the substrate into a formamide, Imine

II, and AICAR. Two aspartates are used for the cyclization steps.

Figure 4.7 Overall structure difference found between HisF in yeast and bacteria. The yeast structure, PDB 1ox6 contains HisH in light blue and HisF in dark blue in one polypeptide with two phosphate ions in orange. The two regions have separate folds. The bacterial HisF structure, PDB 1thf from Thermotoga maritima, is structurally similar to the yeast HisF.

Figure 4.8 Part 1 of the general proposed mechanism for HisA/HisF. HisA catalyzes an ammonium-dependent reaction in which

ProFAR is converted PRFAR via an Amadori rearrangement. The furanose oxygen of ProFAR is protonated by the general acid (HA), an aspartate on β-strand 1. The carbon-oxygen bond is cleaved to open the ring, forming a Schiff Base “imine” intermediate. The imine is deprotonated by a general base (B−), an aspartate on β-strand 5, at the C atom of its ribose moiety to form an enolamine. The tautomer changes into the keto form PRFAR.23

Figure 4.9 Part II of the general proposed mechanism for HisA/HisF. HisF uses ammonia and PRFAR to catalyze the closure of the imidazole ring, generating AICAR and ImGP. A nucleophilic ammonia attacks the phosphoribosyl carbonyl oxygen of PRFAR resulting in the release of water and formation of Imine 1.Water hydrolyzes the imine, separating into imine II and AICAR. For the ring closure, one aspartate (B) deprotonates the –CH2 group of the imine II which forms a cyclic intermediate. The other aspartate (A) deprotonates the -NH group to form the imine bond of the imidazole product.24

In the HisA/HisF subclass, there are six reported proteins with solved crystal structure and experimental information available from Thermotoga maritima (PDB: 1qo2), Streptomyces coelicolor (PDB: 1vzw), Mycobacterium tuberculosis H37Rv (PDB: 2y85), Thermotoga maritima (PDB: 1thf), Pyrobaculum aerophilum (PDB: 1h5y) and Saccharomyces cerevisiae

(PDB: 1ox6). The structures, PDB: 1qo2, 1vzw and 2y85 are of the HisA branch of the subclass and PDB: 1thf, 1h5y and 1ox6 of the the HisF branch. The two conserved aspartate active site residues are in structurally identical positions for both the HisA and HisF enzymes that carry out general acid/base reactions. The CSA reports two catalytic residues (Table 4.4) for this subclass.

In HisA, the aspartate** acts as the general acid which protonates the substrate to yield a Schiff

Base and aspartate* acts as the general base to abstracts a proton from the C2 of the intermediate and transfers it to the C1 position yielding the PRFAR.23 In HisF, the aspartate* acts as a general acid to deprotonates the –CH2 group of the imine II which forms a cyclic intermediate while aspartate** acts as a general base that deprotonates the -NH group to form the imine bond of the imidazole product.24

Table 4.4 The reported catalytic residues from the CSA for the subclass HisA/HisF. Symbols are used to denote different amino acids. There are two different groups of aspartates, distinguished by their position in the sequence. The first aspartate is denoted as aspartate* and aspartate** for the second.

HisA/HisF * ** 1qo2 D8 D127 1vzw D11 D130 2y85 D11 D130 1thf D11 D130 1h5y D12 D133 1ox6 D245 D404

4.1.2.3 Essential metabolic pathway enzymes

The next group of evolutionarily similar proteins include ribulose-phosphate 3-epimerase (RPE), orotidine 5′-monophosphate decarboxylase (OMPDC), keto-3-gulonate-6-phosphate decarboxylase (KGPDC) and hexulose phosphate synthase (HPS). These enzymes are essential in various metabolic pathways for the synthesis of substrates.

The ribulose-phosphate 3-epimerase (RPE) enzyme is essential in the pentose phosphate cycle, which RPE interconverts ribulose-5-phosphate (R5P) into D-xylulose-5-phosphate (X5P).25 The epimerization is achieved by general acid/base catalysis. In the RPE subclass, there are five reported proteins with solved crystal structure and experimental information available from

Solanum tuberosum (PDB: 1rpx), Streptococcus pyogenes (PDB: 2fli), Oryza sativa (PDB:

1h1y), Synechocystis sp. (PDB: 1tqj) and Homo sapiens (PDB: 3ovp). The CSA reports four catalytic residues (Table 4.5) with a Zn2+ ion required for catalysis by stabilizing the 2,3- enediolate intermediate formed. For catalysis, the Zn2+ ion is coordinated by hydrogen-bonding interactions to histidine† and histidine††. Aspartate◊ abstracts a proton from C3 of the R5P.

Protonation of the 2,3-enediolate intermediate is facilitated by the aspartate◊◊ to yield X5P.

The orotidine 5′-monophosphate decarboxylase (OMPDC) subclass is an essential and proficient enzyme (1017-fold enzymatic catalyzed rate compared to the non-enzymatic reaction) for the biosynthesis of pyrimidine nucleotides.26 OMPDC catalyzes the metal ion-independent decarboxylation of orotidine monophosphate (OMP) to form uridine monophosphate (UMP) and

CO2. In the proposed reaction mechanism, OMPDC catalyzes through a vinyl carbanion reaction intermediate (Figure 4.10).27 The enzymes are biologically active in dimer form with the active site located at the dimer interface.28 In the OMPDC subclass, there are seven reported proteins with solved crystal structure and experimental information available from Bacillus subtilis (PDB:

1dbt), Methanothermobacter thermautotrophicus (PDB: 1dv7), Saccharomyces cerevisiae (PDB:

1dqw), Escherichia coli (PDB: 1l2u), Plasmodium falciparum (PDB: 2za1), Leishmania infantum (PDB: 3qw3) and Homo sapiens (PDB: 3l0k). The CSA reports three catalytic residues

(Table 4.5) which are in a conserved Asp-x-Lys-x-x-Asp motif. The first aspartate and lysine residues come from one polypeptide subunit and the second aspartate comes from the neighboring polypeptide subunit in the dimer. The negative charge of the carboxylate group of aspartate* causes electrostatic/steric destabilization to the orotate ring to initiate

28 decarboxylation. CO2 is released creating the C6 carbanion intermediate. Lysine acts as a general acid to protonate the C6 carbon to form the UMP product.29 The aspartate** hydrogen bond is maintained to the hydroxyl group of the ribosyl moiety and lysine residue.30,31

Figure 4.10 The general mechanism for ribulose-phosphate 3-epimerase (RPE). An octahedral coordinated Zn2+ ion is located in the active site and coordinates with the nitrogen atom of two histidine residues along with the carboxylate oxygen atom from both catalytic aspartates and the O2 and O3 of the D-xylitol-5-phosphate. Proton abstraction from C3 of the ribulose phosphate by an aspartate generates an anionic 2,3-enediolate intermediate. The Zn+2 stabilizes the anionic oxygen of C2. Protonation of the intermediate by an aspartate yields xylulose-5-phosphate.32

Table 4.5 The reported catalytic residues from the CSA for the subclasses OMPDC, RPE,

KGPDC and HPS. Symbols used to denote different amino acids, for example in KGPDC there are three different groups of aspartates, distinguished by position in the sequence, D11/D1,

D62/D64 and D67/D69, which are noted as aspartate●, aspartate●● and aspartate●●●, respectively.

OMPDC * ** 1dbt D60 K62 D65 1dv7 D70 K72 D75 1dqw D91 K93 D96 1l2u D71 K73 D76 2za1 D136 K138 D141 3qw3 D82 K84 D87 3l0k D88 K90 D93

RPE † ◊ †† ◊◊ 1rpx H41 D43 H74 D185 2fli H34 D36 H67 D176 1h1y H36 D38 H69 D178 1tqj H35 D37 H68 D179 3ovp H35 D37 H70 D175

KGPDC ● ●● ●●● 1xbv D11 E33 D62 K64 D67 3exr D13 E35 D64 K66 D69

HPS 3AJX

Keto-3-gulonate-6-phosphate decarboxylase (KGPDC) is similar to the orotidine-5'-phosphate decarboxylase (OMPDC), homologous enzymes that catalyze mechanistically distinct reactions using different substrates. In KGPDC, there is the conserved Asp-x-Lys-x-x-Asp motif, seen in the OMPDC subclass. KGPDC catalyzes the metal ion-dependent decarboxylation of 3-keto-L- gulonate 6-phosphate (3KG6P). The Mg2+ stabilizes the 1,2-enediolate intermediate. In the

KGPDC subclass, there are two reported proteins with solved crystal structure and experimental information available from Escherichia coli (PDB: 1xbv) and Streptococcus mutans (PDB:

3exr). The CSA reports five catalytic residues (Table 4.5), the first aspartate● interacts with the glutamate and second aspartate●●. The glutamate and aspartate●● interact with the Mg2+ ion.

The magnesium ion coordinates to the oxygen atom of the keto-group at the C3 position and the oxygen atom of the hydroxyl-group at the C4 position of the substrate.33 Once the intermediate is stabilized, the lysine and the aspartate●●● hydrogen bond to the oxygen atom of C2 to stabilize and position the intermediate for proper addition of the proton, via a water molecule.34

The KGPDC results from the CSA server do not include the histidine which plays a role in protonation of the water molecule and act as a proton shuttle to protonate the enediolate.35 The complete reaction mechanism is not fully understood; more crystal structures may explain more.

Figure 4.11 The general mechanism for orotidine-5’-monophosphate decarboxylase (OMPDC). Decarboxylation of OMP proceeds by a stepwise mechanism through a UMP vinyl carbanion intermediate. The proposed mechanism of action is that the negative charge of the carboxylate group of aspartate causes electrostatic/steric destabilization of the orotate ring to initiate decarboxylation.28 The loss of

CO2 forms the vinyl anion intermediate and protonation of the carbanion of C6 occurs by the lysine, acting as a general base, to form

UMP.27 The lysine residue is then reprotonated by solvent to regenerate the active site.29

Figure 4.12 The general mechanism for keto-3-gulonate-6-phosphate decarboxylase (KGPDC). Upon binding, the substrate decarboxylates to form the 1,2-cis-enediolate intermediate and carbon dioxide. The magnesium ion coordinates to the oxygen atom of the keto-group at the C3 position and the oxygen atom of the hydroxyl-group at the C4 position of the substrate.33 A glutamate and aspartate interact with the Mg2+ ion. Once intermediate is stabilized, the lysine and another aspartate hydrogen bond to the oxygen atom of C2 to stabilize and position the intermediate for proper addition of the proton, via a water molecule and histidine.34

Hexulose phosphate synthase (HPS) is a key enzyme in the ribulose monophosphate pathway

(RuMP) which is involved in formaldehyde fixation in methylotrophic bacteria, who can utilize methanol as a sole carbon and energy source.36 HPS catalyzes the reversible Mg2+-aldol condensation of D-ribulose 5-phosphate (R5P) and formaldehyde to form D-arabino-3-hexulose

6-phosphate (A3H6P).37 The HPS and KGPDC share several conserved residues, are metal-ion dependent and are capable of catalyzing each other’s reaction.38 The Mg2+ binding site is conserved between HPS and KGPDC. HPS contains the D-x-K-x-x-D motif at the end of the 3rd

β-strand. The Mg2+ ion is coordinated to the two oxygen atoms of glutamate and aspartate. The

Mg2+ ion will ligate to C2 and C3 of the intermediate; this is different from KGPDC, in which the ligation occurs at the C3 and C4. Proton abstraction from the 1-hydroxymethylene group of

R5P to generate the enediolate intermediate occurs by the catalytic base, histidine.37 There is one reported crystal structure and experimental information available from Mycobacterium gastri

MB19 (PDB: 3ajx). The CSA does not have any reported catalytic residues for this subclass

(Table 4.5). In HPS, the Mg2+ ion interacts with the substrate and there is a formation of the enediolate intermediate. A glutamate and aspartate are ligated to the Mg2+ ion. The magnesium ion coordinates to the oxygen atom of the keto-group at the C2 position and the oxygen atom of the hydroxyl-group at the C4 position of the substrate.37 A histidine acts as the catalytic base to abstract a proton from the intermediate to produce D-arabino-3-hexulose-6-phosphate (A3H6P).

The overall 3D structure of these proteins is very similar but the reactions they catalyze are distinctively different. Members from the subclasses were selected based on their low sequence identity from one another and the availability of PDB structures and of verified biochemical function. A list of structural genomics proteins was generated from the PDB based on the designation that they were solved at any one of the structural genomics centers.

Figure 4.13 The general mechanism for hexulose phosphate synthase (HPS). The initial step of aldol condensation requires proton abstraction from the 1-hydroxymethylene group of R5P by histidine to generate an enediolate intermediate. The currently proposed mechanism of HPS is a transfer of a proton and formaldehyde via water molecule to the C1 of the enediolate intermediate to synthesize D-arabino-3-hexulose-6-phosphate (A3H6P). There is no bound HPS-substrate complex to clarify addition of formaldehyde.37

4.1.3 Sequence identity between subclasses within the RPBB superfamily

To demonstrate that this method is effective in sorting the subclasses using a diverse set of proteins, structures were chosen such that sequence homology between any two members of the set was as low as possible. A sequence identity of 40% was set as a cut-off between each pair of proteins in the subclass (Table 4.6). For some subclasses, the number of structures was limited, so higher sequence identities were allowed between diverse species. For example, in the RPE subclass, the structure PDB: 1rpx, is from a genus of flowering plant (Solanum tuberosum) and has a sequence identity of 68% to structure PDB: 1tqj from Synechocystis species, a genus of cyanobacteria. Note that these two species have similarity of phototrophic ability of the cyanobacteria and photosynthetic capability of the chloroplasts of the plant. Thus, in spite of high sequence identity, proteins from these two diverse species were allowed in the dataset.

Superfamilies with well-studied proteins generally have more structures available. In the RPBB, a few subclasses contained more information than others, so the data set is smaller for those subclasses. There was at least one crystal structure for each subclass. In addition to selection based on low mutual sequence identity, structures with higher resolution were also chosen. A structure with a bound ligand in complex helps in gathering information to understand the interactions between the known ligand and catalytic residues. An apo structure provides a better structure match for the structural genomics proteins, as they typically are crystallized without a bound ligand; it is therefore easier to compare active sites. Due to the evolutionary relationships between subclasses and similarity in the 3D structure, sequence identity alone cannot be a reliable method to sort the superfamily. Using structural information will give more information at local sites of alignment. Identifying the local site will better determine with reliability into which functional subclasses the structural genomics proteins can be classified correctly.

Table 4.6 Sequence identity matrix using the multiple sequence alignment program Clustal Omega.39 The black outlined regions highlight the proteins within a subclass. A sequence identity greater than 40% is colored orange and as the percentage increases, it become red (>60%). Lower sequence identity is indicated by the colors yellow (>20%) and green (>10%).

1PII_B 1LBM 1PII_A 1I4N 2C3Z 1GEQ 1QOP 1XC4 1RD5 1QO2 1VZW 2Y85 1THF 1H5Y 1OX6 1RPX 2FLI 1H1Y 1TQJ 3OVP 1DBT 1DV7 1DQW 1L2U 2ZA1 3QW3 3L0K 1XBV 3EXR 3AJX P42405 1PII_B 100 31.61 14.65 16.98 11.11 14.62 13.74 14.5 15.38 16.54 13.67 17.14 12.5 13.7 14.92 12.4 10.7 6.56 14.29 7.14 6.98 9.76 6.87 8.73 11.92 10.08 10.48 9.76 6.5 12.61 8.33 1LBM 31.61 100 14.37 14.71 16.56 15.5 11.54 7.69 13.18 12.69 10.71 11.35 12.41 12.24 12.64 11.76 7.56 6.78 12.9 7.38 5.38 12.5 5.43 4.72 9.55 13.6 6.56 7.5 7.5 6.03 7.69 1PII_A 14.65 14.37 100 31.28 34.84 13.33 12.5 13.82 13.25 14.55 14.04 14.53 12.72 16.57 17.02 13.48 12.8 12.14 10.96 13.19 8.61 10 8.92 8.72 10.33 11.26 12.58 9.66 8.28 15.22 10.79 1I4N 16.98 14.71 31.28 100 33.94 16.23 16.03 16.03 13.46 13.17 14.45 14.37 14.61 17.22 15.09 11.27 12 14.18 16.33 12.93 13.16 21.28 7.98 10.53 9.09 10.53 13.55 13.7 14.38 13.67 16.43 2C3Z 11.11 16.56 34.84 33.94 100 14.39 14.18 14.89 16.43 19.35 14.38 12.96 16.05 15.85 17.24 18.18 14.4 17.05 17.04 18.05 11.35 14.73 6.85 10.71 10.4 10.56 11.43 14.07 14.81 7.63 12.88 1GEQ 14.62 15.5 13.33 16.23 14.39 100 32.26 33.47 39.09 11.25 16.87 17.79 12.72 16.18 11.33 14.72 17.4 12.69 20.2 16.26 15.27 15.98 11.17 11.88 11.56 9.85 10.66 10.42 9.84 15.22 15.14 1QOP 13.74 11.54 12.5 16.03 14.18 32.26 100 85.45 31.8 15.95 16.47 17.47 11.86 13.48 17.76 14.55 11.8 12.92 15.02 12.21 14.76 13.94 10.22 13.82 9.54 11.98 15.74 12.56 15.42 16.4 10.94 1XC4 14.5 7.69 13.82 16.03 14.89 33.47 85.45 100 30.65 17.18 17.06 17.47 12.43 13.48 15.89 13.62 13.3 13.88 15.49 13.15 14.29 14.42 10.22 14.75 9.54 10.6 15.28 11.56 14.43 16.4 13.54 1RD5 15.38 13.18 13.25 13.46 16.43 39.09 31.8 30.65 100 16.98 15.66 14.2 12.21 15.61 17.62 14.9 13.1 15.2 17.31 15.38 12.2 20.2 13.12 14.08 10 9.26 14.15 12.69 12.56 16.58 15.79 1QO2 16.54 12.69 14.55 13.17 19.35 11.25 15.95 17.18 16.98 100 26.96 25.65 20.09 24.34 18.88 12.2 6.17 11.18 10.91 7.5 13.25 10.69 9.2 9.15 10.5 13.75 10 10.56 13.04 9.03 14.1 1VZW 13.67 10.71 14.04 14.45 14.38 16.87 16.47 17.06 15.66 26.96 100 68.33 23.93 25 19.58 12.2 11.8 14.11 13.17 12.8 13.25 11.73 11.76 15.06 9.09 15.66 12.57 13.66 17.18 17.53 16.13 2Y85 17.14 11.35 14.53 14.37 12.96 17.79 17.47 17.47 14.2 25.65 68.33 100 23.18 28.09 21.25 13.5 11.8 13.12 14.02 14.91 16.87 13.12 11.45 16.97 11.48 15.95 10.43 15.53 18.52 16.88 14.19 1THF 12.5 12.41 12.72 14.61 16.05 12.72 11.86 12.43 12.21 20.09 23.93 23.18 100 54.4 45.06 14.46 11 15.15 14.71 12.43 13.1 12.12 11.43 11.9 11.29 13.58 14.71 11.04 14.55 14.74 12.1 1H5Y 13.7 12.24 16.57 17.22 15.85 16.18 13.48 13.48 15.61 24.34 25 28.09 54.4 100 43.08 19.16 13.5 16.97 15.29 12.43 13.69 14.46 10.23 8.88 12.7 14.72 12.28 14.72 10.91 19.23 14.65 1OX6 14.92 12.64 17.02 15.09 17.24 11.33 17.76 15.89 17.62 18.88 19.58 21.25 45.06 43.08 100 13.16 10.4 12.43 13.68 11.64 11.11 10.94 12.04 10.66 9.87 12.82 16.19 13.19 15.22 14.86 12.5 1RPX 12.4 11.76 13.48 11.27 18.18 14.72 14.55 13.62 14.9 12.2 12.2 13.5 14.46 19.16 13.16 100 46.4 40.72 67.86 38.07 16.83 16.27 11.79 16.51 15.79 13.4 14.08 14.29 17.17 17.55 17.28 2FLI 10.74 7.56 12.77 11.97 14.39 17.44 11.82 13.3 13.13 6.17 11.8 11.8 11.04 13.5 10.38 46.36 100 47.91 45 43.93 18.54 18.91 14.36 19.12 15.58 16.85 14.29 16.06 18.56 22.58 17.55 1H1Y 6.56 6.78 12.14 14.18 17.05 12.69 12.92 13.88 15.2 11.18 14.11 13.12 15.15 16.97 12.43 40.72 47.9 100 41.26 52.25 16.19 17.79 12.8 15.31 13.46 16.06 12.68 14 17.82 20.94 12.89 1TQJ 14.29 12.9 10.96 16.33 17.04 20.2 15.02 15.49 17.31 10.91 13.17 14.02 14.71 15.29 13.68 67.86 45 41.26 100 37.5 18.57 21.74 12.26 16.35 13.4 13.02 14.56 14.65 20.1 17.55 18.85 3OVP 7.14 7.38 13.19 12.93 18.05 16.26 12.21 13.15 15.38 7.5 12.8 14.91 12.43 12.43 11.64 38.07 43.9 52.25 37.5 100 16.83 16.59 10.9 16.02 12.08 13.23 12.32 12.56 19.5 18.52 13.54 1DBT 6.98 5.38 8.61 13.16 11.35 15.27 14.76 14.29 12.2 13.25 13.25 16.87 13.1 13.69 11.11 16.83 18.5 16.19 18.57 16.83 100 19.91 21.46 40.17 18.75 16.32 22.07 19.05 17.37 18.91 19.61 1DV7 9.76 12.5 10 21.28 14.73 15.98 13.94 14.42 20.2 10.69 11.73 13.12 12.12 14.46 10.94 16.27 18.9 17.79 21.74 16.59 19.91 100 17.29 21.76 23.53 23.56 21.74 25.49 23.79 28.57 28.14 1DQW 6.87 5.43 8.92 7.98 6.85 11.17 10.22 10.22 13.12 9.2 11.76 11.45 11.43 10.23 12.04 11.79 14.4 12.8 12.26 10.9 21.46 17.29 100 19.11 15.14 18.18 49.42 13.57 16.18 14.14 15.46 1L2U 8.73 4.72 8.72 10.53 10.71 11.88 13.82 14.75 14.08 9.15 15.06 16.97 11.9 8.88 10.66 16.51 19.1 15.31 16.35 16.02 40.17 21.76 19.11 100 15.57 13.4 18.64 17.39 20.95 17.09 16.34 2ZA1 11.92 9.55 10.33 9.09 10.4 11.56 9.54 9.54 10 10.5 9.09 11.48 11.29 12.7 9.87 15.79 15.6 13.46 13.4 12.08 18.75 23.53 15.14 15.57 100 32.55 16.43 16.67 17.41 16.75 17.62 3QW3 10.08 13.6 11.26 10.53 10.56 9.85 11.98 10.6 9.26 13.75 15.66 15.95 13.58 14.72 12.82 13.4 16.9 16.06 13.02 13.23 16.32 23.56 18.18 13.4 32.55 100 20.51 15.56 15.3 21.39 16.57 3L0K 10.48 6.56 12.58 13.55 11.43 10.66 15.74 15.28 14.15 10 12.57 10.43 14.71 12.28 16.19 14.08 14.3 12.68 14.56 12.32 22.07 21.74 49.42 18.64 16.43 20.51 100 13.99 16.16 17.65 18.52 1XBV 9.76 7.5 9.66 13.7 14.07 10.42 12.56 11.56 12.69 10.56 13.66 15.53 11.04 14.72 13.19 14.29 16.1 14 14.65 12.56 19.05 25.49 13.57 17.39 16.67 15.56 13.99 100 46.76 34.95 32.06 3EXR 6.5 7.5 8.28 14.38 14.81 9.84 15.42 14.43 12.56 13.04 17.18 18.52 14.55 10.91 15.22 17.17 18.6 17.82 20.1 19.5 17.37 23.79 16.18 20.95 17.41 15.3 16.16 46.76 100 29.13 27.75 3AJX 12.61 6.03 15.22 13.67 7.63 15.22 16.4 16.4 16.58 9.03 17.53 16.88 14.74 19.23 14.86 17.55 22.6 20.94 17.55 18.52 18.91 28.57 14.14 17.09 16.75 21.39 17.65 34.95 29.13 100 39.13 P42405 8.33 7.69 10.79 16.43 12.88 15.14 10.94 13.54 15.79 14.1 16.13 14.19 12.1 14.65 12.5 17.28 17.6 12.89 18.85 13.54 19.61 28.14 15.46 16.34 17.62 16.57 18.52 32.06 27.75 39.13 100

4.1.4 Identifying structural genomics (SG) proteins in the RPBB superfamily

What identifies a structural genomics protein? The criteria include keywords found listed with

PDB structures: putative, hypothetical, unknown, or structural genomics. The PDB page (Figure

4.14) needs to be searched for any experimental information about a structure solved at one of the structural genomics centers. In the RPBB superfamily, there were 28 structures with a keyword match (Table 4.7). In a DALI structure search, two structures were identified to have similarity to proteins structures in the PDB. The two structures, PDB: 1y0e and 1yxy, have putative N-acetylmannosamine-6-phosphate 2-epimerase function. They were chosen to act as negative controls for our SALSA and SALSA-DT methods. The typical resolution for the x-ray crystal structures solved at structural genomics centers, range between 1.2 - 2.2 Å. Most structures are apo structures, due to the general crystallization methods used for these types of proteins and because the native substrate is generally not known. The PSI program launched in

2000; solved structures began to be deposited in the PDB as early as 2002 and are continuing in

2014. Identifying these structures is a difficult task. A rigorous and repetitive structure similarity search, keyword search and filtering has to be performed to identify SG proteins from a certain superfamily. There is not an announcement of new SG protein structures being released. There is no standard schedule of release, due to the various centers depositing at different rates with different research priorities. The list is compiled and POOL is run on the SG proteins. The SG proteins are structurally aligned to each of the subclasses through multiple structure alignments and scored using the BLOSUM62 matrix. The subclass with the highest matches of positive scores against the characteristic sets of catalytic and binding residues was further analyzed. The functional annotation is made after that process is complete.

Figure 4.14 An example of a structural genomics protein page found in the PDB. In panel A, the structure is of “unknown function,” this is one of 2749. In panel B, the protein structure has no published journal citation but it does list the structural genomics center.

The PDB page contains very limited information

Table 4.7 The list of structural genomics proteins gathered from the PDB that have a sequence, keyword match or structural similarity to proteins found in the RPBB superfamily.

PDB ID Species From PDB Resolution Authors Year 2aqw Plasmodium yoelii Putative OMPDC 2.00 Å Structural Genomics Consortium 2007 3ldv Vibrio cholera Putative OMPDC 1.77 Å Center for Structural Genomics of Infectious Diseases 2010 1vqt Thermotoga maritime Putative OMPDC 2.00 Å Joint Center for Structural Genomics 2005 2ffc Plasmodium vivax Putative OMPDC 1.70 Å Structural Genomics Consortium 2007 2cze Pyrococcus horikoshii OT3 Putative OMPDC 2.00 Å RIKEN Structural Genomics/Proteomics Initiative 2006 3r89 Anaerococcus prevotii DSM 20548 Putative OMPDC 1.84 Å Midwest Center for Structural Genomics 2011 3v75 Streptomyces avermitilis Putative OMPDC 1.40 Å Midwest Center for Structural Genomics 2012 1vc4 Thermus thermophilus Putative !GPS 1.80 Å RIKEN Structural Genomics/Proteomics Initiative 2004 3qja Mycobacterium tuberculosis Putative IGPS 1.29 Å TB Structural Genomics Consortium 2011 4fb7 Mycobacterium tuberculosis H37Rv Putative IGPS 1.30 Å Midwest Center for Structural Genomics 2012 3tsm Brucella melitensis biovar Abortus 2308 Putative IGPS 2.15 Å Seattle Structural Genomics Center for Infectious Disease 2011 1ujp Thermus thermophilus hb8 Putative TrpA 1.34 Å RIKEN Structural Genomics/Proteomics Initiative 2003 2e09 Pyrococcus furiosus Putative TrpA 2.40 Å RIKEN Structural Genomics/Proteomics Initiative 2007 1wq5 Escherichia coli Putative TrpA 2.30 Å RIKEN Structural Genomics/Proteomics Initiative 2005 Campylobacter jejuni subsp. jejuni NCTC 3tha Putative TrpA 2.37 Å Center for Structural Genomics of Infectious Diseases 2011 11168 = ATCC 700819 Vibrio cholerae O1 biovar El Tor str. 3nav Putative TrpA 2.10 Å Center for Structural Genomics of Infectious Diseases 2010 N16961 3inp Francisella tularensis subsp. Tularensis Putative RPE 2.05 Å Center for Structural Genomics of Infectious Diseases 2009 3cu2 Haemophilus somnus Putative RPE 1.91 Å Joint Center for Structural Genomics 2008 1tqx Plasmodium falciparum 3D7 Putative RPE 2.00 Å Structural Genomics of Pathogenic Protozoa Consortium 2006 3qc3 Homo sapiens Putative RPE 2.20 Å Joint Center for Structural Genomics 2011 3f4w Salmonella typhimurium Putative HPS 1.65 Å - 2008 2a0n Thermotoga maritima Putative HisF/HisA 1.64 Å Joint Center for Structural Genomics 2005 1vh7 Thermotoga maritima Putative HisF/HisA 1.90 Å Structural GenomiX 2005 Campylobacter jejuni subsp. jejuni NCTC 4gj1 Putative HisF/HisA 2.15 Å Center for Structural Genomics of Infectious Diseases 2012 11168 = ATCC 700819 2agk Saccharomyces cerevisiae His6 protein 1.30 Å Paris-Sud Yeast Structural Genomics 2006 1ka9 Thermus thermophilus Putative HisF/HisA 2.30 Å RIKEN Structural Genomics/Proteomics Initiative 2002 1y0e Staphylococcus aureus (strain N315) Putative ManNAc-6-P epimerase 1.95 Å Midwest Center for Structural Genomics 2004 1yxy Streptococcus pyogenes Putative ManNAc-6-P-epimerase 1.60 Å Midwest Center for Structural Genomics 2005

4.2 Analysis of RPBB superfamily using SALSA

Here a method for protein function prediction is presented utilizing functional active site predictors which in turn use the 3D structure to determine the catalytic and active site residues of a protein. Along with other online tools, a classification set is created with known structures of proteins with biochemical functions that have been experimentally validated. A scoring system helps to determine the best proteins to fit within a set. The SG proteins are them compared to this known set and assigned a correct function. This is all based on structural comparison using multiple structure alignments servers. Eight functional subclasses in the RPBB superfamily have been identified. With the functional active site predictors, we have sorted the active site residues and found differences between each of the subclasses. With this information many structural genomics proteins have been correctly annotated. The SALSA methodology has been demonstrated in the RPBB superfamily. Large-genomic scale annotations have been performed on a few subclasses within the superfamily but this is the first structure-based functional analysis of the superfamily.

4.2.1 Criteria for the dataset

To first start with identifying structures within a superfamily, functional information was gathered from the SCOP database to define subclasses. Each subclass was run through a multiple structure alignment server. TCOFFEE 3D gave an output file to use in CHIMERA for sequence and visualization analysis. The different sets of POOL input features that were tried for the superfamily known set and structural genomics proteins are listed in Table 4.8. The top 9% of residues in the POOL rank order were used as a cut-off to limit the amount of false positives.

Table 4.8 Reported input features used for POOL calculations of each protein. The input features are as follows: .ibm4ranks – THEMATICS + INTREPID; .bm4ranks = THEMATICS;

.tcranks = THEMATICS + ConCavity.

Subclass PDB Features SG - PDB Features 1pii .ibm4ranks 1ka9 .ibm4ranks IGPS 1i4n .TICranks 1tqx .ibm4ranks 2c3z .ibm4ranks 1ujp .ibm4ranks 1geq .ibm4ranks 1vc4 .ibm4ranks 1qop .ibm4ranks 1vh7 .bm4ranks TrpA 1xc4 .tcranks 1vqt .ibm4ranks 1rd5 .ibm4ranks 1wq5 .ibm4ranks 1kfk .tcranks 1y0e .ibm4ranks 1lbm .ibm4ranks 1yxy .ibm4ranks PRAI 1pii .ibm4ranks 2a0n .tcranks 1qo2 .ibm4ranks 2agk .bm4ranks 1vzw .ibm4ranks 2aqw .ibm4ranks 2y85 .tcranks 2cze .tcranks HisA/HisF 1thf .ibm4ranks 2e09 .bm4ranks 1h5y .ibm4ranks 2ffc .ibm4ranks 1ox6 .tcranks 3cu2 .ibm4ranks 1rpx .ibm4ranks 3f4w .ibm4ranks 2fli .ibm4ranks 3inp .ibm4ranks RPE 1h1y .ibm4ranks 3ldv .ibm4ranks 1tqj .ibm4ranks 3nav .ibm4ranks 3ovp .tcranks 3qc3 .tcranks 1dbt .ibm4ranks 3qja .tcranks 1dv7 .ibm4ranks 3r89 .tcranks 1dqw .ibm4ranks 3tc6 .tcranks OMPDC 1l2u .ibm4ranks 3tc7 .tcranks 2za1 .ibm4ranks 3tha .tcranks 3qw3 .tcranks 3tsm .tcranks 3l03 .tcranks 3v75 .tcranks 1xbv .ibm4ranks 4fb7 .tcranks KGPDC 3exr .ibm4ranks 4gj1 .bm4ranks 3ajx .ibm4ranks HPS P42405 .tcranks

For this superfamily, the best collection of correctly predicted active site residues was using the

.ibm4ranks method which incorporates INTREPID scores and THEMATICS. In 2011, the

INTREPID server went offline and it is not available to incorporate their scoring output files with the POOL method. The experimentally validated data set of known proteins used the ibm4rank scoring output files to identify consensus signatures. The next best method was

.tcranks, which includes THEMATICS and ConCavity scoring. The active site residues predicted using ConCavity plus THEMATICE were better than THEMATICS alone. For this superfamily, the .TICranks input features did not do better than the .tcranks or .ibm4ranks input features.

4.2.2 Building homology models for structures in the RPBB superfamily

For the crystal structures of 16 proteins, regions were missing (Table 4.9). Homology modeling using YASARA was used to build the missing regions. The number of missing residues varied between three up to over 20 residues. The homology model and crystal structures were submitted to POOL. If residues missing were located near the active site, it caused a significant issue predicting the catalytic active site residues. For example, a homology model was constructed for

PDB 1DV7 and the model had more predicted active site residues. This was essential to identify all members in the consensus signatures. Proteins with residues missing away from the active site were less affected, which was seen in PDB: 1vzw, 1lbm and 2y85. One crystal structure was available for the HPS subclass in 2010. A homology model was constructed for this subclass using the protein sequence of 3-hexulose-6-phosphate synthase (UniProt ID: P42405) from the experimentally verified Bacillus subtilis (strain 168).40

Table 4.9 The PDB listing of structures that required building of a homology models. Some proteins missed between three to twenty residues. YASARA was used to build homology models. Homology models made are in BOLD.

PDB Residues missing Location Impact 1geq G165-E173 after β-strand 6 None 1kfk S178-L193 after β-strand 6 None 1ujp T177-E188 after β-strand 6 None 1vzw L20-T27, A169-Q175 after β-strand 1 & 6 None 1wq5 P57-I64, T183-E186 after β-strand 2 & 6 None 1xc4 S55-G75, V182-R188 after β-strand 2 & 6 None 2e09 T168-E173 after β-strand 6 None 3nav A180-A189 after β-strand 6 None 3qja K59-D73 after β-strand 1 None 1lbm T127-F139 after β-strand 5 None 2y85 L20-E30 after β-strand 1 None 1ox6 V256-L276 after β-strand 7 None 1tqj N144-Q150 after β-strand 6 None 1dv7 G181-E191 after β-strand 7 Missing active site residues 1l2u Q194-I204 after β-strand 7 Missing active site residues 3qw3 I201-S207 after β-strand 7 None 4gj1 V142-S149, D176-Q180 after B-strand 6 & 7 Missing active site residues 2agk I18-S37, A180-C186 after B-strand 1 & 7 Missing active site residues

4.2.3 Identifying consensus signatures in the RPBB superfamily

Using information gathered from the sequence and structural data available, the POOL method was used to predict the important residues for each subclass (Table 4.10). In the following sections, the important residues selected for consensus signatures of each subclass will be described in detail. It is important to identify the diversity of residues between groups that fall along the β-strands which give rise to various functions. The incorporation of information from literature and the webservers Mechanism, Annotation and Classification in Enzymes (MACiE)41 and Ligand-Protein Contacts & Contacts of Structural Units (LPC)42 server helped to define the consensus signatures.

Table 4.10 Spatially aligned consensus signatures for each subclass. The PDB code for each protein is given in the left column with its correlating subclass. Each row represents a protein structure and each column represents an aligned spatial position. The POOL predicted residues (UPPERCASE LETTERS). Using POOL predictions and experimental literature, the consensus signatures (in blue) were determined for each subclass and consist of specific amino acid types found in specific spatial positions.

Position of aligned residues β1 strand β2 strand β3 strand β4 strand β5 strand β6 strand β7 strand β8 strand PDB ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1pii C260 G280 R289 H334 D379 A405 G406 S428 a429

PRIA 1lbm C7 G27 R36 H83 D126 S157 G158 s181 g182 1pii E53 K55 K114 D115 F116 E163 N184 R186 E214 S215 g236 s237 1i4n E47 K49 K108 D109 F110 E157 N179 R181 E209 S210 G231 T232

IGPS 2c3z E51 K53 K110 D111 F112 E159 N180 R182 E210 S211 g233 s234 1geq Y10 E36 P44 D47 q53 Y88 D116 Y161 G197 F198 G199 G220 1qop F22 E49 P57 D60 Q65 Y102 D130 Y175 g211 F212 G213 g234 1xc4 F22 E49 p57 D60 q65 y102 D130 Y175 g211 F212 g213 S233

TrpA 1rd5 Y23 E50 P58 D61 Q66 Y102 D126 Y171 G207 F208 G209 G230 1qo2 D8 H48 r98 D127 T164 D169 A194 g195 v222 g223 1vzw D11 H50 R100 D130 T166 D171 S196 g197 g222 k223 2y85 D11 H50 R100 D130 T170 D175 S200 g201 g226 k227 1thf D11 v48 K99 D130 t171 D176 S201 g202 a224 s225

HisA/HisF 1h5y D12 a51 K102 D133 t174 D179 S204 g205 A227 s228 1ox6 D245 t295 K360 D404 N469 D474 S499 S500 A523 g524 1rpx S16 H41 D43 M45 D72 H74 M76 H98 E100 M147 D185 g186 g207 s208 2fli S9 H34 D36 M38 D65 H67 M69 H91 E93 M138 D176 g177 g198 s199 1h1y S11 H36 D38 M40 D67 H69 M71 H93 E95 m144 D178 g179 g200 s201 RPE 1tqj S10 H35 D37 M39 D66 H68 M70 H92 E94 M141 D179 g180 g201 s202 3ovp S10 H35 D37 m39 D68 H70 m72 H94 E96 m141 D175 g176 G197 s198 1dbt D11 K33 D60 K62 D65 H88 P182 g183 g214 R215 1dv7 D20 K42 D70 K72 D75 H98 P180 g181 g202 R203 1dqw D37 K59 D91 K93 D96 H122 P202 G203 G234 R235 1l2u D22 K44 D71 K73 D76 H99 P189 G190 g221 R222 2za1 D23 K102 D136 K138 D141 n165 P264 G265 g293 R294

OMPDC 3qw3 D21 K49 D82 K84 d87 s111 P199 G200 s228 R229 3l0k D35 K57 D88 K90 d93 H119 P193 g194 G226 R227 1xbv D11 E33 D62 K64 D67 E112 H136 R139 T169 G170 G191 R192

KGPDC 3exr D13 E35 D64 K66 d69 E117 H141 R144 T174 G175 G196 R197 3ajx D1008 E1030 D1059 K1061 D1064 D1109 H1134 A1164 g1165 G186 g1187

HPS P42405 D8 E30 D59 K61 d64 D109 H134 a165 g166 G187 G188

4.2.3.1 Phosphoribosyl anthranilate isomerase (PRAI)

There are five catalytic residues predicted by POOL (Table 4.10) for the phosphoribosyl anthranilate isomerase (PRAI) subclass. Here positions refer to the spatial positions shown in

Table 4.10. The sulfhydryl group of the cysteine (position 1) acts as a general base (Figure

4.15).15 The arginine (position 8) forms a salt bridge with the carboxylate group of the anthranilate moiety of CdRP. The histidine (position 14) acts as a coordinating ligand by interaction with the ribose ring of CdRP or as a proton shuttle in the active site. The oxygen atoms of the carboxylate group of aspartate (position 19) form a hydrogen bond to the C4- hydroxyl oxygen of CdRP. The glycine (position 3) possibly provides a neutral environment to increase the relative catalytic power of charged moieties of Cys and Arg in the surrounding area.

Figure 4.15 The active site of PRAI with bound ligand CdRP in light purple. The structurally aligned crystal structures from E. coli (PDB: 1pii) in purple and T. maritima (PDB: 1lbm) in light green are shown. A small region from 1lbm was removed to gain a better view.

4.2.3.2 Indole glycerol phosphate synthase (IGPS)

There are ten catalytic residues predicted by POOL (Table 4.10). The glutamate (position 1) hydrogen bonds to the C3’ hydroxyl group and forms a salt-bridge to lysine (position 10).18 The lysine (position 2), ε-ammonium group, forms a with the carboxylate group of CdRP but also hydrogen bonds to the C3’ hydroxyl group CdRP.18 The lysine (position 10) is the general acid and glutamate (position 16) is the general base, both essential for catalytic activity. The phenylalanine (position 12) becomes crucial for its hydrophobic interaction to the benzene moiety of the CdRP and IGP. The glutamate (position 21) and the asparagine (position 19), form a hydrogen-bonded pair to regenerate the base glutamate (position 16). The serine (position 22) assists in coordinating the orientation of molecule during catalysis (Figure 4.16), the ligand IGP, from PDB 1a53, lower resolution model of 2c3z, but is shown to demonstrate the interactions.

Figure 4.16 The active site of IGPS with bound ligand IGP in green. The structurally aligned crystal structures from Escherichia coli (PDB: 1pii:B) in gray, Thermotoga maritime (PDB:

1i4n) in tan and Sulfolobus solfataricus (PDB:2c3z) in yellow are shown with residue numbers from PDB 1pii:B.

4.2.3.3 Tryptophan synthase (TrpA)

There are nine catalytic residues predicted by POOL (Table 4.10). The CSA (Table 4.3) reports two active site residues, a glutamate (position 3) and an aspartate (position 6). The glutamate protonates the hydroxyl group of the glyceraldehyde moiety, acting both as a proton donor and acceptor. The aspartate acts as a general base to deprotonate the nitrogen of the indole ring.

Figure 4.17 shows the residues: tyrosine (position 12), proline (position 5), and glutamine

(position 7) interacting with the catalytic aspartate. Tyrosine (position 18) forms hydrogen bonds with the hydroxyl group of the C2 and C3 atoms of IGP. Tyrosine (position 1) and phenylalanine (position 22) create hydrophobic interactions in the binding site.

Figure 4.17 The active site of TrpA with bound ligand IGP in gray. The structurally aligned crystal structures of Pyrococcus furiosus (PDB: 1geq) in brown, Salmonella enterica subsp. enterica serovar Typhimurium (PDB: 1qop) in blue, Escherichia coli (PDB: 1xc4) in purple and

Zea mays (PDB: 1rd5) in green are shown. The residues numbers come from PDB 1geq.

4.2.3.4 Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (HisA) and

imidazoleglycerolphosphate synthase (HisF)

There are eight catalytic residues predicted by POOL (Table 4.10). The CSA (Table 4.4) reports two catalytic aspartates in position 2 and 17. Due to the two types of reactions occurring in this subclass, the residues act both as a general acid and general base.

Figure 4.18 The active site of HisA/HisF with bound PRFAR in white. The structurally aligned

HisA crystal structures from Thermotoga maritima (PDB: 1qo2), Streptomyces coelicolor (PDB:

1vzw) and Mycobacterium tuberculosis H37Rv (PDB: 2y85) and HisF structures from

Thermotoga maritima (PDB: 1thf), Pyrobaculum aerophilum (PDB: 1h5y) and Saccharomyces cerevisiae (PDB: 1ox6). The black residue labels correlate to HisA PDB 1vzw and the blue residues to HisF PDB 1thf.

In HisA, a histidine (position 3) interacts with the aspartate (position 2) and the substrate

PROFAR. In site-directed mutagenesis studies, the threonine (position 19) showed catalytic activity.15 The threonine may act as a nucleophile alongside the histidine and aspartate which are within 6Å.

In HisF, a lysine (position 13) recruits PRFAR to the active site of HisF, which is replaced with an arginine in HisA. PRFAR contains two phosphate groups that stretch along the active site. On the 3rd β-strand, Gly81, is in the glycine-rich region for phosphate binding region 1. On the opposite side, the serine (position 21) lies along the 7th β-strand next to the phosphate binding region 2. The two phosphate groups and aspartate (position 20) are involved in coordination of the substrate PRFAR. PRFAR is from the PDB structure 1ox5 which is 99.8% identical to 1ox6.

4.2.3.5 Ribulose phosphate epimerase (RPE)

There are eleven catalytic residues predicted by POOL (Table 4.10). The CSA reports four catalytic residues are in positions 3, 4, 10 and 21. The two histidines (positions 3 and 10) form hydrogen bonds to the Zn2+ ion and position the ion for catalysis. The aspartate (position 4) abstracts a proton from C3 of the ribulose phosphate. Protonation of the 2,3-enediolate intermediate is facilitated by the aspartate (position 21). Other residues predicted by POOL are in positions 1, 5, 9, 12, 14, 15 and 19. The substrate is bound to the protein via hydrogen bonding interactions between O4 of X5P and the hydroxyl group of serine (position 1).32 The aspartate

(position 9) and histidine (position 14) form a hydrogen bond to histidine (position 3), moving the histidine closer to coordinate with the Zn2+ ion.32

Figure 4.19 The active site of RPE with bound ligand D-xylitol 5-phosphate in white. The structurally aligned crystal structures from Solanum tuberosum (PDB: 1rpx) in dark blue,

Streptococcus pyogenes (PDB: 2fli) in cyan, Oryza sativa (PDB: 1h1y) in green, Synechocystis sp. (PDB: 1tqj) in purple and Homo sapiens (PDB: 3ovp) in brown. The residue labels correlate to PDB 1rpx. PDB 2fli was solved with the bound ligand, X5P.32

There are three strictly conserved methionine residues in position 5, 12 and 19 and their sulfur atoms line the active center pocket preventing a protonation of the O2 atom of R5P.43

Methionine is the amino acid residue most frequently exchanged during evolution which indicates they are important for catalysis for this subclass.44 The glutamate (position 15) interacts with histidine (position 14) and is conserved in this group.

4.2.3.6 Orotidine monophosphate decarboxylase (OMPDC)

There are eight catalytic residues predicted by POOL (Table 4.10). The CSA reports three catalytic residues in positions 10, 12 and 13. The aspartate (position 10) causes electrostatic and steric destabilization of the orotate ring to initiate decarboxylation (Figure 4.18). The lysine

(position 12) protonates the carbanion of C5. The aspartate (position 13) is from the second polypeptide and interacts with the 2’-hydroxyl group of the ribose moiety.

Figure 4.20 The active site of OMPDC with bound ligand OMP in cyan (from PDB 2za1). The structurally aligned crystal structures from Bacillus subtilis (PDB: 1dbt) in orange,

Saccharomyces cerevisiae (PDB: 1dqw) in light blue, Escherichia coli (PDB: 1l2u) in purple,

Plasmodium falciparum (PDB: 2za1) in light green. The residue labels correlate to PDB 1dbt.

Other residues predicted by POOL are in positions 2, 3, 14, 21 and 24. The aspartate in position

2, hydrogen bonds to the 3’-hydroxyl group of the ribose moiety. The lysine (position 3) interacts with the 3’-hydroxyl group of the ribose moiety and possibly helps to transfer protons

to neighboring aspartates. The histidine (position 14) is a conserved residue in all OMP decarboxylase sequences but makes no direct contacts with the substrate. The proline (position

21) points towards the C5 carboxylate group which creates the hydrophobic cavity.45 Other residues that may be involved with hydrophobic interactions were not POOL predicted. In position 24, the arginine’s guanidinium group interacts with the phosphate group of OMP.46

4.2.3.7 Keto-gulonate-6-phosphate decarboxylase (KGPDC)

There are ten residues predicted by POOL (Table 4.10). The CSA reports five catalytic residues that are in positions 2, 3, 10, 12, and 13. The aspartate (position 2) interacts with glutamate

(position 3). The glutamine (position 3) and aspartate (position 10) are ligated to the Mg2+ ion.

The magnesium ion coordinates to the oxygen atom of the keto-group at the C3 position and the oxygen atom of the hydroxyl-group at the C4 position of the substrate.33 Lysine (position 12) and aspartate (position 13) interact with the Mg2+ ion which stabilizes and positions the intermediate for proper addition of the proton.34 Other residues predicted by POOL are in positions 16, 18, 20, 21 and 24. The arginine (position 20) and histidine (position 18) function as proton sources to competitively deliver protons to the C1 position of the enediolate intermediate by participating in a proton relay system in the active site with two water molecules.35

Protonation can lead to either R or S configurations depending on which of the two conserved residues is used as the ultimate source of the proton.47 The glutamate (position 16) interacts with histidine (position 18) and moves it closer to the intermediate. The histidine initiates substrate protonation by transferring a proton via a water molecule to the C1 of the intermediate. The threonine (position 21) forms a hydrogen bond with oxygen atom of xylulose-5-phosphate.

The arginine’s guanidium group (in position 24) interacts with the phosphate group of the intermediate. There are two glycine residues conserved next to the arginine which is the phosphate binding region for this subclass.

Figure 4.21 The active site of KGPDC with bound ligand ribulose 5-phosphate. The structurally aligned crystal structures from Escherichia coli (PDB: 1xbv) in cyan and Streptococcus mutans

(PDB: 3exr) in tan. PDB 1xbv was solved with the bound ligand, R5P. The residue labels correlate to PDB 1xbv. The R5P is not a substrate or ligand but gives an idea of the coordination of the Mg2+ ion (lime green) to oxygen atoms.

4.2.3.8 Hexulose-6-phosphate synthase (HPS)

There are six catalytic residues predicted by POOL (Table 4.10). The CSA does not report any catalytic residues. The HPS is similar to KGPDC in positions 2, 3, 10, and 12, 13 because they have common residues for coordination of the Mg2+ ion and and for formation of the enediolate intermediate. The aspartate (position 2) interacts with glutamate (position 3). The glutamate

(position 3) and aspartate (position 10) coordinate to the Mg2+ ion. The magnesium ion then coordinates to the oxygen atom of the keto-group at the C2 position and the oxygen atom of the hydroxyl group at the C4 position of the substrate.37 Lysine (position 12) and aspartate (position

13) interact with the Mg2+ ion, which stabilizes and positions the intermediate for proper addition of the proton.37 The aspartate (position 16) is shown to be a catalytic shuttle for a proton. The histidine (position 18) acts as the catalytic base to abstract a proton from the intermediate to produce D-arabino-3-hexulose-6-phosphate (A3H6P). The lack of an arginine residue (position

20) in this active site provides space for access of formaldehyde to the enediolate intermediate.37

Figure 4.22 The active site of HPS. This is a structural alignment of the crystal structure from

Mycobacterium gastri MB19 (PDB: 3ajx) in rose and the homology model P57421 in light yellow. The residue labels correlate to PDB 3ajx.

The advantage of using POOL to extrapolate information about the differences in the active site provides a clearer picture. If we compare the CSA to the consensus signatures (Table 4.11), more information is available from the POOL predictions to sort the superfamily! With each subclass above sorted for this superfamily, information regarding evolutionary changes can be observed but also assignment for structural genomics proteins can be performed.

4.2.4 Using consensus signatures to analyze the RPBB superfamily

The consensus signatures for the group are very interesting to analyze (Table 4.10) because of the active site architecture that is conserved between subclasses. Though the chemistry is different, the spatial arrangement tells the story behind the evolutionary changes that have occurred in this superfamily. There has not been an analysis for the whole superfamily but only for a few subclasses.48 The following sections describe the spatial arrangements for subclasses in detail.

4.2.4.1 PRAI vs. HisA: Different substrates, similar reaction mechanism

The first example is to compare the phosphoribosyl anthranilate isomerase (PRAI) and phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (HisA). The two substrates differ in molecular weight with ProFAR from HisA being twice the size of PRA from

PRAI. The two enzymes catalyze similar Amadori rearrangements of an aldose into a ketose.

This general acid-base mechanism utilizes an acid (AH) which protonates the furanose ring of the substrate, PRA for PRAI and ProFAR for His A (Figure 4.23). On β-strand 1, for PRAI a cysteine (position 1) is an equivalent general base to the aspartate (position 2) from HisA. On β- strand 2 for HisA, the histidine (position 3) is important for its interaction to the basic aspartate.

Figure 4.23 The substrates and products from the enzymes PRAI and HisA

Figure 4.24 Overlay of active site residues for PRAI and HisA. PRAI (pink) is bound with its product CdRP (rose) with its catalytic residues Cys 7 and Asp 126. HisA (green) with its product PRFAR (dark cyan) and catalytic residues Asp 8 and Asp 127.

On β-strand 5 of HisA, the aspartate (position 17) is a general acid but is equivalent to the position of the aspartate (position 19) on β-strand 6 of the PRAI. Even though the general acid is located on a different strand, it is still performing the same function.15 Comparing the active site of PRAI and HisA shows the positioning of the two catalytic aspartates (Figure 2.24). Though there is not a bound structure reported for PROFAR in HisA, there is a bound structure of

PRFAR in HisF. Using the PRFAR, the HisF was aligned to the HisA. This alignment is here to show the differences and similarities in the active site of the two enzymes. Due to the larger

PROFAR/PRFAR, the location of the catalytic residues in the active site can be explained. This similarity of the general acid for the Amadori rearrangement was first observed by sequence alignments from Henn-Sax.15 The Asp 126 interacts with the O4-CdRP in PRAI while Asp 127 forms hydrogen bonds with the ribose hydroxyl groups of the PRFAR in HisA. There is a threonine that aligns with Asp 126 (PRAI) that is shown to be catalytically important for HisA in mutagenesis studies.15 The threonine cannot provide a proton for the catalysis but Asp 127 can.

In the consensus table, the two catalytic aspartates are found in different regions of the β-strands.

POOL is able to predict the residues and others that are essential in the two subclasses.

4.2.4.2 OMPDC vs. KGPDC vs. HPS vs. RPE: Conserved spatial architecture of the active site

OMPDC, KGPDC and HPS share a conserved aspartate active site residue at the end of β-strand

1 and the D-x-x-K-D motif located at the end of β-strand 3. RPE does not contain the D-x-x-K-

D motif. KGPDC and HPS share several conserved residues and share a similar intermediate with the Mg2+ ion. KGPDC can catalyze the HPS reaction at low efficiency due to the similar spatial active site.49 The reaction mechanisms are different for each subclass (Figure 4.25).

OMPDC has been hypothesized as the progenitor for KGPDC and HPS.34 OMPDC catalyzes the decarboxylation of orotidine monophosphate (OMP) to uridine monophosphate (UMP)

17 50 efficiently (Kcat ~ 10 ) versus an uncatalyzed reaction. The reaction is carried out with a metal ion. KGPDC catalyzes the metal ion-dependent decarboxylation of 3-keto-L-gulonate 6- phosphate (3KG6P). HPS catalyzes the reversible Mg2+-aldol condensation of D-ribulose 5- phosphate (R5P) and formaldehyde to form d-D-arabino-3-hexulose 6-phosphate (A3H6P).37

RPE interconverts ribulose-5-phosphate (R5P) into D-xylulose-5-phosphate (X5P).25 The sequence identify between these four subclasses is around 25% (Table 4.7). OMPDC is found in various species for its importance in pyrimidine biosynthesis. KGPDC is found in microorganisms that utilizes L-ascorbate as a carbon source and HPS is found in microorganisms, methylotrophic bacteria,that utilize formaldehyde.33,51

To understand changes within the active sites of OMPDC, KGPDC and HPS, the substrates become of interest. OMP is larger than R5P and 3KG6P; this will affect the distance between important residues and size of the active site. The furanose ring of the OMP will have different interactions from those of the open-chain conformation of R5P and 3KG6P. KGPDC incorporates a formaldehyde molecule in the reaction process. Next, the reaction mechanisms are similar between OMPDC and KGPDC, conducting a decarboxylation, while HPS catalyzes a rearrangement. With these differences and similarities the active site can be analyzed by aligning the local active site with bound structures to study the interactions. The POOL predictions from the consensus signatures allow more information to explain the subtle differences along with literature.

Figure 4.25 The reaction mechanisms for OMPDC, KGPDC, HPS and RPE. A) The OMPDC reaction produces uridine monophosphate. B) The KGPDC reaction produces L-xylulose-5- phosphate. C) The HPS reaction produces an arabino-hexulose-6-phosphate. D) The RPE reaction produces D-xylulose-5-phosphate.

Changes to the active site between OMPDC, KGPDC and HPS are noted (Figure 4.26-A) in the red circles. The stars highlight a conserved aspartate and the D-x-x-K-D motif (Figure 4.26A.

In Table 4.11, each position describes the activity of the residue, relative to its spatial position, for each subclass. Overall, the spatial positions along the eight β-strands are conserved but different types of residues are involved due to the changes in the specificity, mechanism, and reaction pathway of the enzyme. Due to OMPDC not needing a Mg2+ ion like KGPDC and HPS, the role of the D-x-x-K-D motif, in positions 10, 12, 13, is different. KGPDC and HPS use the aspartate and lysine to position the 1,2-enediolate intermediate. This could be due to the open- form of the substrates, D-ribulose-5-phosphate and 3-keto-L-gulonate-6-phosphate. The bulky pyrimidine moiety has hydrophobic interactions (proline 182 – position 21) where KGPDC and

HPS have hydrogen bonding occurring between substrates and residues in the same area. For example, the OMPDC is missing a histidine (position 18); the histidine acts as a base in KGPDC and HPS reactions (Figure 4.26B). OMPDC contains an aspartate which destabilizes the UMP but requires a lysine as a shuttle, thus not needing a histidine (Figure 4.26C).

A closer look between OMPDC and KGPDC (Figure 4.27C and D) shows the phosphate binding region is perfectly aligned, including the arginine (position 24). The two substrates, UMP and

R5P, bound in the crystal structures PDB 3exs and 1dbt, give insight to the position and interactions. The 3exs structure, KGPDC bound with R5P, provides interactions for the KGPDC mechanism. The substrate for KGPDC is K3G6P, for which there is no bound structure. A similar structure (PDB 1xbv) of a mutated KGPDC is bound with 3KG6P, but gives a different conformation due to the mutation of catalytic residues. The study of the active sites of these two enzymes further demonstrates an overall conserved, similar architecture.

Figure 4.26 Aligned active sites of OMPDC, KGPDC and HPS. There are seven differences in the active site residues found with

POOL predictions. All three structures share the same D-x-x-K-D motif. A) Aligned OMPDC (in green), KGPDC (purple) and HPS

(gray) with a Mg2+ ion (lime green). The red circles indicate changes in the active site residues and stars indicate conserved residues.

B) OMPDC, KGPDC and HPS with ribulose-5-phosphate (pink). C) OMPDC, KGPDC and HPS with uridine-5'-monophosphate

(dark green).

Table 4.11 Comparison of consensus signatures between the OMPDC, KGPDC, HPS and RPE subclasses.

Subclass and PDB ID OMPDC - 1dbt KGPDC - 3exs HPS - 3ajx RPE - 2fli Position Position number Residue Purpose Residue Purpose Residue Purpose Residue Purpose

1 a9 A11 A1006 S9 hydrogen bonds with O4 of X5P β-strand 1 hydrogen bonds to 3'-OH group of 2 D11 D13 interacts with Glu 35 D1008 interacts with Glu 1030 l11 ribose moiety interacts with hydroxyl group as 3 K33 E35 coordinates to Mg2+ ion E1030 coordinates to Mg2+ ion H34 coordinates to Zn2+ ion proton shuttle abstracts proton from C3 of β-strand 2 4 g35 G37 G1032 D36 substrate coserved, prevents protonation of 5 M36 t38 T1033 M38 O2 of R5P

9 F58 v62 F1057 D65 hydrogen bonds to His34

destabilizes UMP at C6 to initate 10 D60 D64 coordinates to Mg2+ ion D1059 coordinates to Mg2+ ion H67 coordinates to Zn2+ ion decarboxylation β-strand 3 interact with Mg2+ ion and oxygen interact with Mg2+ ion, stabilize coserved, prevents protonation of 12 K62 protonates the carbanion of C5 K66 K1061 M69 of C1 intermediate O2 of R5P interacts with 2'-OH group of interact with Mg2+ ion and oxygen interact with Mg2+ ion, stabilize 13 D65 D69 D1064 e74 ribose moiety of C1 intermediate

14 H88 conserved in all OMPDCs I89 L1084 H91 hydrogen bonds to His34 β-strand 4 15 a90 c90 g1085 E93 interacts with histidine 91

interacts with His 141 through β-strand 5 16 v119 E117 D1109 catalytic shuttle for proton transfer v115 hydrogen-bonded dyad proton source to deliver at C1 catalytic base to abstract proton 18 V160 H141 H1134 l136 position for intermediate coserved, prevents protonation of β-strand 6 19 s162 s143 g1136 M138 O2 of R5P 2nd proton source for C, R or S 20 - R144 l1137 v140 position hydrogen bonds to C4-hydroxyl 21 P182 Hydrophobic interactions T174 A1164 D176 protonates the intermediate group of X5P β-strand 7 conserved phosphate binding conserved phosphate binding conserved phosphate binding conserved phosphate binding 22 g183 G175 g1165 g177 region region region region conserved phosphate binding conserved phosphate binding conserved phosphate binding conserved phosphate binding 23 g214 G196 G1186 g198 region region region region β-strand 8 2- 2- 24 R215 interacts with PO4 of OMP R197 interacts with PO4 of X5P g1187 s199

Figure 4.27 Aligned active site of OMPDC and KGPDC. A) Aligned active site of OMPDC (in green) and KGPDC (purple). B) OMPDC and KGPDC with ribulose-5-phosphate (pink). C) 2D structures of UMP and R5P. D) OMPDC and KGPDC with uridine-5'-monophosphate (dark green).

The differences between KGPDC and HPS (Figure 4.28), is observed in two residues in the active site. Keto-gulonate-6-phosphate decarboxylase (KGPDC) and hexulose-6-phosophate synthase (HPS) catalyze different reactions but share a similar Mg2+-assisted stabilization of the

1,2-enediolate intermediates.49 The sequence identity between the KGPDC (UniProt: P39304,

Escherichia coli) and HPS (UniProt: Q9LBW4, Mycobacterium gastri) using Clustal Omega39 is

34.95%. HPS catalyzes the KGPDC reaction with high efficiency and KGPDC catalyzes the

HPS reaction with low efficiency.37,49,52 In the consensus signatures, the two enzymes share six

residues with similar interactions to their corresponding 1,2-enediolate intermediates. In the active site of KGPDC, a glutamate (position 16) is found but for HPS a conserved aspartate is located in that spatial position. The glutamate interacts with the histidine which is a proton source. While the aspartate does not do the same to its nearest histidine for the HPS. The histidine abstracts the proton from the R5P. One other difference is the arginine; it is missing in the HPS which has a hydrophobic residue in the corresponding position. An explanation is that

KGPDC needs to have interactions with the phosphate group of the xylulose-phosphate. The incorporation of formaldehyde to the active site needs to occur in the HPS. The exact mechanism of how that occurs is still unknown but this area of the active site would allow for addition.37 Focusing on the difference of two residues between the KGPDC and RPE (Figure

4.29), one observes differences in the active site residues but the architecture of the active site is intact, as for the KGPDC and OMPDC. The spatial positions of the active site residues align very well between these two enzymes. Two major differences between KGPDC and RPE are the lack of the D-x-x-K-D motif and the presence of a Zn2+ ion instead of a Mg2+ ion. Along the β- strand 2, the two enzymes contain a residue that coordinates to a metal ion but in the RPE there are more residues that interact.

Figure 4.28 Aligned active site of KGPDC and HPS. A) Aligned KGPDC (purple), KGPDC

(gray) and HPS (gray) with Mg2+ ion. B) KGPDC and HPS with ribulose-5-phosphate (pink).

C) 2D structure of R5P.

This is also the case along the β-strand 3, the two enzymes contain a residue that coordinates to a metal ion but also have residues that are involved in the acid base chemistry. The pairs of residues that align with the consensus signatures are very interesting, positions 2,3, 10-12, 16, and 18. With bound xyluylose-5-phosphate (Figure 4.30B), the interactions for the RPE can be better understood.

Figure 4.29 Aligned active site of KGPDC and RPE. The overall architecture of the active site is very similar but the residues are different. A) Aligned KGPDC (purple) and RPE (cyan) with

Zn2+ ion (gray) and Mg2+ ion (lime green). B) Aligned KGPDC and RPE with bound L-xylulose-

5-phosphate (X5P).

Though the chemistry is different, the spatial arrangement and consensus signatures show the evolutionary changes that have occurred in this superfamily.

4.2.5 Scoring the SALSA Table

Using the Match to Consensus Signatures (MCS), as described in Chapter 3, for the RPBB superfamily, the SALSA table (Table 4.12) was scored between the subclasses as a measure of how the residues are similar. To gain a better perspective of the advantage of POOL predictions, it is noted that the SALSA table gives more information about important residues for each subclass. The CSA provides only a few residues which makes differentiating the subclasses very difficult. For example, the PRIA subclass has only two CSA identified catalytic residues but

POOL provides five important for catalysis. Between the OMPDC and KGPDC, the CSA provides two more residues but the POOL identifies more. The POOL methodology is very accurate in predicting the active site residues. Out of the 93 CSA literature cited amino acids, 90 were predicted by POOL. Only three fell below the cut-off value implemented.

Using the MCS scoring method, the resulting values can be found in Table 4.13. The trend of proteins grouped within a subclass have positive values and between subclasses it is observed there are negative values. The green regions indicate a better match score above 0.40. A bad match is colored red, which is seen in most of the regions that are not in within the same subclass. The area that has issues is between the KGPDC and HPS. This is due to the similarity between the active site residues, with only two residues that differ between two subclasses. HPS does not contain an arginine on the 8th β-strand that is seen for the KGPDC subclass. However, there is promiscuity between the enzymes, where KGPDC can catalyze HPS reactions at low rates. This is the only part of the table where significant positive values are observed across subclasses. The table has values that can be compared to the structural genomics proteins.

Table 4.12 The consensus signatures (in blue) for each subclass compared to CSA reported catalytic residues (in red). The consensus signatures provide more information for residues that are important.

Position of aligned residues β1 strand β2 strand β3 strand β4 strand β5 strand β6 strand β7 strand β8 strand PDB ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1pii C260 G261 G280 i282 v284 s287 R289 v308 v310 f311 R312 d315 H334 Q332 a358 s360 v377 D379 f389 A405 G406 S428 a429

PRIA 1lbm C7 G8 G27 v29 y31 s34 R36 v57 v59 f60 v61 e64 H83 Q81 a103 g105 l124 D126 f139 S157 G158 s181 g182 1pii E53 K55 S85 l87 d89 y92 l112 K114 D115 F116 i118 m137 s139 E163 s165 G182 N184 R186 E214 S215 g236 s237 1i4n E47 K49 s79 L81 e83 y86 l106 K108 D109 F110 i112 i131 r133 E157 h159 g177 N179 R181 E209 S210 G231 T232

IGPS 2c3z E51 K53 S81 l83 e85 y88 l108 K110 D111 F112 v114 I133 k135 E159 n161 G178 N180 R182 E210 S211 g233 s234 1geq Y10 t12 E36 G38 P44 D47 Q52 s54 v84 M86 t87 Y88 y93 v115 D116 l139 a141 Y161 v163 l165 G197 F198 G199 G220 1qop F22 t24 E49 g51 P57 D60 Q65 a67 G98 l100 m101 Y102 f107 a129 D130 i153 p155 Y175 l177 r179 g211 F212 G213 g234

TrpA 1xc4 F22 t24 E49 G51 p57 D60 q65 a67 G98 L100 M101 y102 f107 a129 D130 i153 p155 Y175 l177 r179 g211 F212 g213 S233 1rd5 Y23 t25 E50 G52 P58 D61 Q66 s58 v98 l100 s101 Y102 m107 p125 D126 149 t151 Y171 v173 v175 G207 F208 G209 G230 1qo2 a6 D8 H48 v50 D51 l52 s53 q77 g79 g80 g81 r98 s102 s103 s125 D127 v162 T164 D169 A194 g195 v222 g223 1vzw a9 D11 H50 v52 D53 l54 d55 E79 s81 g82 g83 R100 g104 t105 g128 D130 v164 T166 D171 S196 g197 g222 k223 2y85 A9 D11 H50 V52 D53 L54 D55 E79 S81 G82 g83 R100 g104 t105 G128 D130 V168 T170 D175 S200 g201 g226 k227 1thf C9 D11 v48 l50 D51 i52 t53 t78 g80 g81 g82 K99 N103 t104 a128 D130 l169 t171 D176 S201 g202 a224 s225

HisA/HisF 1h5y C10 D12 a51 l53 D54 i55 t56 l81 g83 g84 g85 K102 n106 t107 a131 D133 l172 t174 D179 S204 g205 A227 s228 1ox6 C243 D245 t295 l297 n298 i299 t300 t328 g330 g331 g332 K360 g364 t365 S402 D404 L467 N469 D474 S499 S500 A523 g524 1rpx S16 l18 H41 D43 M45 p51 i53 D72 H74 l75 M76 d81 H98 E100 v124 l125 l145 M147 v149 D185 g186 g207 s208 2fli S9 l11 H34 D36 M38 p44 i46 D65 H67 l68 M69 e74 H91 E93 v115 i116 l136 M138 v140 D176 g177 g198 s199 1h1y S11 l13 H36 D38 M40 p46 l48 D67 H69 l70 M71 s76 H93 E95 s118 l119 l142 m144 v146 D178 g179 g200 s201

RPE 1tqj S10 l12 H35 D37 M39 p45 I47 D66 H68 l69 M70 e75 H92 E94 v118 l119 l139 M141 v143 D179 g180 g201 s202 3ovp S10 l12 H35 D37 m39 p45 I47 D68 H70 m71 m72 e77 H94 E96 a118 i119 l139 m141 v143 D175 g176 G197 s198 1dbt a9 D11 K33 g35 M36 F58 D60 l61 K62 D65 H88 a90 v119 q121 V160 s162 P182 g183 g214 R215 1dv7 A18 D20 K42 g44 y45 i68 D70 f71 K72 D75 H98 f100 l123 e125 v155 p157 P180 g181 g202 R203 1dqw s35 D37 K59 H61 v62 F89 D91 r92 K93 D96 H122 v124 l150 e152 i183 q185 P202 G203 G234 R235 1l2u a20 D22 K44 g46 k47 F69 D71 l72 K73 D76 H99 s101 v127 v129 v167 s169 P189 G190 g221 R222

OMPDC 2za1 G21 D23 K102 N104 f105 I134 D136 m137 K138 D141 n165 Y167 l191 k193 V240 g242 P264 G265 g293 R294 3qw3 G19 D21 K49 n51 a52 v80 D82 a83 K84 d87 s111 y113 l133 K135 v175 g177 P199 G200 s228 R229 3l0k S33 D35 K57 H59 v60 F86 D88 r89 K90 d93 H119 v121 i144 e146 i177 g179 P193 g194 G226 R227 1xbv A9 D11 E33 G35 T36 I37 l38 C39 l60 D62 a63 K64 D67 I87 C88 E112 t114 H136 s138 R139 T169 G170 G191 R192

KGPDC 3exr A11 D13 E35 G37 t38 t39 c40 l41 v62 D64 t65 K66 d69 I89 c90 E117 Y119 H141 s143 R144 T174 G175 G196 R197 3ajx A1006 D1008 E1030 G1032 T1033 T1034 l1035 i1036 F1057 D1059 m1060 K1061 D1064 L1084 g1085 D1109 I1111 H1134 g1136 l1137 A1164 g1165 G186 g1187

HPS P42405 A6 D8 E30 G32 T33 T34 v35 v36 l57 D59 l60 K61 d64 l84 g85 D109 i111 H134 g136 y137 a165 g166 G187 G188

Table 4.13 The MCS for the RPBB SALSA table. The red represents a bad MCS score and green is a good MCS. Within the subgroups, green should be observed. This is the case for the RPBB superfamily. The KGPDC and HPS groups have similar active sites and high MCS scores relative to each other. The darker the green, the better match. The darker the red, the worse the match.

PRIA IGPS TrpA HisA/HisF RPE OMPDC KGPDC HPS 1pii 1lbm 1pii 1i4n 2c3z 1geq 1qop 1xc4 1rd5 1qo2 1vzw 1thf 1h5y 1ox6 1rpx 2fli 1h1y 1tqj 3ovp 1dbt 1dv7 1dqw 1l2u 2za1 3qw3 3l0k 1xbv 3exr 3ajx HM 1pii 1 0.754 -0.1 -0.13 -0.12 -0.15 -0.22 -0.22 -0.17 -0.09 -0.13 0.009 0.044 -0.07 -0.04 -0.08 -0.08 -0.08 -0.04 -0.14 -0.23 -0.2 -0.24 -0.24 -0.21 -0.18 -0.23 -0.18 -0.14 -0.14 1lbm 0.754 1 0.036 -0.01 0.018 -0.17 -0.22 -0.22 -0.21 -0.13 -0.12 0.051 0.085 -0.02 -0.05 -0.03 -0.05 -0.03 0.009 -0.21 -0.27 -0.14 -0.23 -0.29 -0.27 -0.13 -0.23 -0.17 -0.14 -0.12 1pii -0.1 0.036 1 0.809 0.855 -0.27 -0.25 -0.25 -0.25 0.027 -0.03 0 0.018 -0.05 -0.16 -0.15 -0.06 -0.15 -0.12 -0.26 -0.05 -0.02 -0.01 -0.05 -0.06 -0.03 -0.23 -0.28 -0.23 -0.25 1i4n -0.13 -0.01 0.809 1 0.868 -0.27 -0.25 -0.25 -0.25 -0.1 -0.13 -0.1 -0.08 -0.14 -0.16 -0.16 -0.11 -0.16 -0.15 -0.26 -0.06 -0.03 -0.05 -0.05 -0.05 -0.03 -0.29 -0.34 -0.29 -0.31 2c3z -0.12 0.018 0.855 0.868 1 -0.25 -0.24 -0.24 -0.23 -0.08 -0.09 -0.08 -0.06 -0.11 -0.15 -0.15 -0.09 -0.15 -0.14 -0.25 -0.05 0 -0.02 0 0 0.018 -0.26 -0.32 -0.25 -0.28 1geq -0.15 -0.17 -0.27 -0.27 -0.25 1 0.602 0.602 0.763 -0.25 -0.28 -0.3 -0.31 -0.34 -0.11 -0.12 -0.07 -0.1 -0.09 -0.26 -0.13 -0.15 -0.09 -0.16 -0.14 -0.16 -0.14 -0.09 -0.11 -0.14 1qop -0.22 -0.22 -0.25 -0.25 -0.24 0.602 1 1 0.562 -0.19 -0.23 -0.26 -0.26 -0.29 -0.11 -0.08 -0.07 -0.1 -0.09 -0.13 -0.16 -0.19 -0.14 -0.2 -0.18 -0.2 -0.17 -0.17 -0.16 -0.18 1xc4 -0.22 -0.22 -0.25 -0.25 -0.24 0.602 1 1 0.562 -0.19 -0.23 -0.26 -0.26 -0.29 -0.11 -0.08 -0.07 -0.1 -0.09 -0.13 -0.16 -0.19 -0.14 -0.2 -0.18 -0.2 -0.17 -0.17 -0.16 -0.18 1rd5 -0.17 -0.21 -0.25 -0.25 -0.23 0.763 0.562 0.562 1 -0.22 -0.28 -0.3 -0.31 -0.34 -0.11 -0.12 -0.07 -0.1 -0.13 -0.25 -0.2 -0.24 -0.11 -0.22 -0.2 -0.25 -0.15 -0.19 -0.18 -0.19 1qo2 -0.09 -0.13 0.027 -0.1 -0.08 -0.25 -0.19 -0.19 -0.22 1 0.748 0.541 0.541 0.423 -0.23 -0.21 -0.17 -0.22 -0.17 -0.14 -0.15 -0.16 -0.11 -0.15 -0.11 -0.17 -0.08 -0.11 -0.05 -0.06 1vzw -0.13 -0.12 -0.03 -0.13 -0.09 -0.28 -0.23 -0.23 -0.28 0.748 1 0.525 0.517 0.45 -0.14 -0.12 -0.12 -0.13 -0.08 -0.11 -0.18 -0.21 -0.17 -0.13 -0.1 -0.15 -0.04 -0.07 0.058 0.058 1thf 0.009 0.051 0 -0.1 -0.08 -0.3 -0.26 -0.26 -0.3 0.541 0.525 1 0.914 0.75 -0.24 -0.22 -0.22 -0.22 -0.17 -0.16 -0.19 -0.19 -0.16 -0.17 -0.14 -0.19 -0.03 -0.08 -0.03 -0.03 1h5y 0.044 0.085 0.018 -0.08 -0.06 -0.31 -0.26 -0.26 -0.31 0.541 0.517 0.914 1 0.704 -0.26 -0.24 -0.24 -0.24 -0.19 -0.1 -0.19 -0.19 -0.17 -0.17 -0.14 -0.19 -0.02 -0.03 0.009 -0.01 1ox6 -0.07 -0.02 -0.05 -0.14 -0.11 -0.34 -0.29 -0.29 -0.34 0.423 0.45 0.75 0.704 1 -0.33 -0.3 -0.27 -0.31 -0.27 -0.13 -0.17 -0.16 -0.1 -0.16 -0.14 -0.17 0.017 -0.03 0.009 0.009 1rpx -0.04 -0.05 -0.16 -0.16 -0.15 -0.11 -0.11 -0.11 -0.11 -0.23 -0.14 -0.24 -0.26 -0.33 1 0.95 0.882 0.966 0.899 -0.25 -0.28 -0.29 -0.29 -0.29 -0.28 -0.29 -0.24 -0.26 -0.17 -0.19 2fli -0.08 -0.03 -0.15 -0.16 -0.15 -0.12 -0.08 -0.08 -0.12 -0.21 -0.12 -0.22 -0.24 -0.3 0.95 1 0.873 0.983 0.949 -0.25 -0.27 -0.28 -0.29 -0.28 -0.27 -0.29 -0.23 -0.25 -0.14 -0.17 1h1y -0.08 -0.05 -0.06 -0.11 -0.09 -0.07 -0.07 -0.07 -0.07 -0.17 -0.12 -0.22 -0.24 -0.27 0.882 0.873 1 0.897 0.889 -0.29 -0.26 -0.25 -0.22 -0.24 -0.24 -0.25 -0.22 -0.26 -0.12 -0.18 1tqj -0.08 -0.03 -0.15 -0.16 -0.15 -0.1 -0.1 -0.1 -0.1 -0.22 -0.13 -0.22 -0.24 -0.31 0.966 0.983 0.897 1 0.932 -0.25 -0.27 -0.28 -0.29 -0.28 -0.27 -0.29 -0.23 -0.25 -0.15 -0.18 3ovp -0.04 0.009 -0.12 -0.15 -0.14 -0.09 -0.09 -0.09 -0.13 -0.17 -0.08 -0.17 -0.19 -0.27 0.899 0.949 0.889 0.932 1 -0.27 -0.25 -0.26 -0.25 -0.24 -0.24 -0.25 -0.21 -0.23 -0.11 -0.13 1dbt -0.14 -0.21 -0.26 -0.26 -0.25 -0.26 -0.13 -0.13 -0.25 -0.14 -0.11 -0.16 -0.1 -0.13 -0.25 -0.25 -0.29 -0.25 -0.27 1 0.761 0.717 0.912 0.628 0.522 0.735 0.061 0.079 0.061 0.061 1dv7 -0.23 -0.27 -0.05 -0.06 -0.05 -0.13 -0.16 -0.16 -0.2 -0.15 -0.18 -0.19 -0.19 -0.17 -0.28 -0.27 -0.26 -0.27 -0.25 0.761 1 0.661 0.695 0.669 0.576 0.636 -0.03 -0.03 0.008 -0.01 1dqw -0.2 -0.14 -0.02 -0.03 0 -0.15 -0.19 -0.19 -0.24 -0.16 -0.21 -0.19 -0.19 -0.16 -0.29 -0.28 -0.25 -0.28 -0.26 0.717 0.661 1 0.641 0.598 0.521 0.923 -0.13 -0.11 -0.09 -0.11 1l2u -0.24 -0.23 -0.01 -0.05 -0.02 -0.09 -0.14 -0.14 -0.11 -0.11 -0.17 -0.16 -0.17 -0.1 -0.29 -0.29 -0.22 -0.29 -0.25 0.912 0.695 0.641 1 0.607 0.5 0.688 -0.01 -0.03 0 -0.05 2za1 -0.24 -0.29 -0.05 -0.05 0 -0.16 -0.2 -0.2 -0.22 -0.15 -0.13 -0.17 -0.17 -0.16 -0.29 -0.28 -0.24 -0.28 -0.24 0.628 0.669 0.598 0.607 1 0.846 0.65 -0.11 -0.11 -0.01 -0.03 3qw3 -0.21 -0.27 -0.06 -0.05 0 -0.14 -0.18 -0.18 -0.2 -0.11 -0.1 -0.14 -0.14 -0.14 -0.28 -0.27 -0.24 -0.27 -0.24 0.522 0.576 0.521 0.5 0.846 1 0.598 -0.13 -0.1 -0.03 -0.03 3l0k -0.18 -0.13 -0.03 -0.03 0.018 -0.16 -0.2 -0.2 -0.25 -0.17 -0.15 -0.19 -0.19 -0.17 -0.29 -0.29 -0.25 -0.29 -0.25 0.735 0.636 0.923 0.688 0.65 0.598 1 -0.14 -0.12 -0.03 -0.05 1xbv -0.23 -0.23 -0.23 -0.29 -0.26 -0.14 -0.17 -0.17 -0.15 -0.08 -0.04 -0.03 -0.02 0.017 -0.24 -0.23 -0.22 -0.23 -0.21 0.061 -0.03 -0.13 -0.01 -0.11 -0.13 -0.14 1 0.714 0.471 0.479 3exr -0.18 -0.17 -0.28 -0.34 -0.32 -0.09 -0.17 -0.17 -0.19 -0.11 -0.07 -0.08 -0.03 -0.03 -0.26 -0.25 -0.26 -0.25 -0.23 0.079 -0.03 -0.11 -0.03 -0.11 -0.1 -0.12 0.714 1 0.48 0.488 3ajx -0.14 -0.14 -0.23 -0.29 -0.25 -0.11 -0.16 -0.16 -0.18 -0.05 0.058 -0.03 0.009 0.009 -0.17 -0.14 -0.12 -0.15 -0.11 0.061 0.008 -0.09 0 -0.01 -0.03 -0.03 0.471 0.48 1 0.843 HM -0.14 -0.12 -0.25 -0.31 -0.28 -0.14 -0.18 -0.18 -0.19 -0.06 0.058 -0.03 -0.01 0.009 -0.19 -0.17 -0.18 -0.18 -0.13 0.061 -0.01 -0.11 -0.05 -0.03 -0.03 -0.05 0.479 0.488 0.843 1

4.2.6 Annotating structural genomics proteins using SALSA

The list of structural genomics proteins (Table 4.7) was generated by searching the PDB and selected as “structural genomics” if they were solved at a structural genomics center. The majority of the solved crystal structures contain one monomer. POOL calculations were performed on the structural genomics proteins to identify their active site residues. Homology models were made for those proteins with missing regions. The proteins were aligned using the multiple structure alignment program T-COFFEE. A SALSA table was generated for the SG proteins (Table 4.14) and scored to identify matching active sites. Aligned residues are not shown for Table 4.14 because the consensus signatures are in focus. The MCS results from the

SALSA analysis (Table 4.15) investigates 26 structural genomics proteins that are putative SG proteins for this superfamily. Out of the 26, 23 of the proteins have correct putative annotations according to the SALSA method. Three of the proteins have scored lower according to the MCS or have active site residues that are problematic. The sequence identity for the structural genomics proteins does not point to a specific biochemical function for the structures with PDB codes 3qja, 4fb7, 2cze, 2ffc, 3r89 and 1vqt. It is not known how the function was assigned to these structural genomics proteins because the information was not provided on the PDB pages.

If the function was assigned by sequence alone, the PDB 3R89 may have trouble differentiating between KGPDC or HPS when the proper assignment is for the OMPDC subgroup. The structural genomics proteins are discussed in the sections below.

Table 4.14 Structurally aligned consensus signatures (in blue) of structural genomics proteins to subclass representatives. Residues in

BOLD are POOL predicted, in green are POOL predicted but a different amino acid, lowercase are non-POOL predicted.

Spatially aligned residues β1 strand β2 strand β3 strand β4 strand β5 strand β6 strand β7 strand β8 strand PDB ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 CS IGPS 2c3z E51 K53 K110 D111 F112 E159 N180 R182 E210 S211 g233 s234 1vc4 E51 K53 K112 D113 F114 E160 N181 R183 E214 S215 g236 3qja E57 K59 K119 D120 F121 E168 N189 R191 E219 S220 g242 3tc6 E51 K53 S110 D111 a112 v159 E180 R182 I210 a211 g233 SG 3tc7 E51 K53 Y110 D111 f112 I159 A180 R182 K210 E211 g233 4fb7 E57 K59 K119 D120 F121 E168 N189 R191 E219 S220 G242 3tsm E60 K62 K122 D123 F124 E171 N192 R194 E222 S223 G245 CS TrpA 1qop F22 E49 P57 D60 Q65 Y102 D130 Y175 F212 G213 1ujp Y21 E47 P55 D58 q63 Y99 D127 Y172 F209 G210 2e09 Y10 E36 p44 D47 q52 Y88 D116 Y161 f198 g199 SG 1wq5 F22 E49 p57 D60 q65 Y102 D130 Y175 f212 g213 3tha Y16 E43 p51 d54 a59 Y93 E121 Y166 f203 g204 3nav F22 E49 p57 D60 q65 Y102 D130 Y175 f212 g213 1vzw D11 H50 R100 D130 T166 D171 S196 g197 g222 CS His A/F 1h5y D12 a51 K102 D133 t174 D179 S204 g205 A227 2a0n D11 V48 K99 D130 t171 D176 s201 g202 a223 1vh7 D11 V48 K99 D130 t171 D179 s201 g202 a223 SG 4gj1 D9 H49 r100 D131 t171 d176 S202 g203 v223 2agk D9 H57 K100 D134 H178 d181 a212 g213 f236 1ka9 D12 v49 K100 D131 t172 D177 s202 g203 a224 CS RPE 1rpx S16 H41 D43 M45 D72 H74 M76 H98 E100 M147 D185 g207 1tqx S11 H36 D38 M40 D68 H70 M72 H92 E94 M145 D179 g201 3cu2 G18 H43 D45 A47 D72 H74 M76 Q98 E100 L154 D192 G216 SG 3inp S9 H34 D36 m38 D66 H68 M70 H92 E94 M139 D177 g199 3qc3 S10 H35 D37 m39 D68 H70 M71 H94 E96 m141 D175 g197 CS OMPDC 1dbt D11 K33 D60 K62 D65 H88 P182 g214 R215 2AQW D23 K105 D139 K141 D144 n168 P267 g296 R297 3LDV D11 K33 D60 K62 D65 H88 P178 g210 R211 1VQT D8 K26 D52 K54 D57 H80 P150 g180 R181 SG 2FFC D41 K133 D167 K169 D172 N196 P295 g324 r325 2CZE D7 K29 D57 K59 d62 H85 P162 G185 R186 3R89 D22 K61 D95 K97 d100 n125 P222 S249 R250 3V75 D25 K62 D96 K98 d101 S128 P221 S249 R250 3ajx D1008 E1030 D1059 K1061 D1064 D1109 H1134 A1164 G186 CS HPS P42405 D8 E30 D59 K61 d64 D109 H134 G187 SG 3f4w D8 E30 D59 K61 d64 D109 H134 a165 G187

Table 4.15 The MCS and sequence identify scores for the structural genomics proteins. The scores are compared between the

SALSA MCS and sequence identity to demonstrate the method provides more information. The scores in green are best match and red are worst match.

Putative Match to Consensus Signature Score Overall Sequence Identity Query Protein Annotation Subclass "Representative" Subclass "Representative" of Query Species PDB IGPS TrpA PRIA HisA HisF RPE OMPDC KGPDC HPS IGPS TrpA PRIA HisA/HisF RPE OMPDC KGPDC HPS 1 Thermus thermophilus 1vc4 IGPS 0.47 - 0.282 - 0.075 - 0.151 - 0.189 - 0.215 - 0.311 - 0.295 - 0.286 35 6 8 11 11 10 5 12 2 Mycobacterium tuberculosis 3qja IGPS 0.44 - 0.273 - 0.047 - 0.123 - 0.217 - 0.206 - 0.302 - 0.286 - 0.276 27 13 4 13 15 8 11 19 Brucella melitensis biovar IGPS 3 Abortus 2308 3tsm 0.39 - 0.273 - 0.084 - 0.113 - 0.142 - 0.224 - 0.292 - 0.295 - 0.276 31 10 2 2 6 6 12 14 Mycobacterium tuberculosis IGPS 4 H37Rv 4fb7 0.44 - 0.273 - 0.047 - 0.123 - 0.217 - 0.206 - 0.302 - 0.286 - 0.276 27 13 4 13 15 9 11 19 5 Thermus thermophilus hb8 1ujp TrpA - 0.32 0.509 - 0.29 - 0.358 - 0.396 - 0.178 - 0.255 - 0.152 - 0.095 9 37 7 9 5 11 4 8 6 Pyrococcus furiosus 2e09 TrpA - 0.31 0.573 - 0.224 - 0.33 - 0.368 - 0.15 - 0.274 - 0.045 - 0.095 7 99 5 9 6 12 3 8 7 Escherichia coli 1wq5 TrpA - 0.29 0.964 - 0.28 - 0.349 - 0.33 - 0.15 - 0.132 - 0.143 - 0.152 13 33 6 10 12 13 3 4 10 Campylobacter jejuni subsp. 3tha TrpA - 0.23 0.473 - 0.15 - 0.377 - 0.368 - 0.121 - 0.292 - 0.125 - 0.2 11 25 5 2 10 12 3 8 Vibrio cholerae O1 biovar El Tor TrpA 9 str. N16961 3nav - 0.28 1 - 0.252 - 0.349 - 0.321 - 0.121 - 0.132 - 0.143 - 0.152 6 33 3 9 11 4 3 14 10 Thermotoga maritima 2a0n HisA/HisF - 0.19 - 0.309 - 0.009 0.481 1 - 0.318 - 0.208 - 0.062 - 0.029 9 10 4 21 6 5 2 13 11 Thermotoga maritima 1vh7 HisA/HisF - 0.19 - 0.309 - 0.009 0.481 1 - 0.318 - 0.208 - 0.062 - 0.029 9 10 4 21 6 5 2 13 12 Campylobacter jejuni subsp. 4gj1 HisA/HisF - 0.16 - 0.3 - 0.103 0.745 0.623 - 0.206 - 0.151 - 0.062 - 0.038 11 8 2 31 12 10 4 9 13 Saccharomyces cerevisiae 2agk HisA/HisF - 0.19 - 0.173 - 0.131 0.217 0.226 - 0.243 - 0.179 - 0.134 - 0.238 7 9 8 12 2 4 9 1 14 Thermus thermophilus 1ka9 HisA/HisF - 0.18 - 0.291 0 0.453 0.925 - 0.318 - 0.189 - 0.071 - 0.048 5 10 6 22 5 10 4 14 Francisella tularensis subsp. RPE 15 Tularensis 3inp - 0.25 - 0.1 - 0.065 - 0.245 - 0.283 0.944 - 0.33 - 0.304 - 0.2 13 6 13 8 40 5 7 10 16 Haemophilus somnus 3cu2 RPE - 0.2 - 0.073 - 0.271 - 0.292 - 0.415 0.467 - 0.358 - 0.223 - 0.314 6 4 4 2 20 9 9 5 17 Plasmodium falciparum 3D7 1tqx RPE - 0.23 - 0.055 - 0.065 - 0.217 - 0.283 0.869 - 0.34 - 0.295 - 0.171 9 11 5 9 33 11 7 11 18 Homo sapiens 3qc3 RPE - 0.22 - 0.182 - 0.308 - 0.132 - 0.236 0.523 - 0.255 - 0.25 - 0.267 13 16 1 11 36 11 13 13 19 Plasmodium yoelii 2aqw OMPDC - 0.29 - 0.145 - 0.206 - 0.123 - 0.217 - 0.327 0.557 - 0.045 - 0.105 4 12 11 4 3 14 8 4 20 Vibrio cholera 3ldv OMPDC - 0.2 - 0.1 - 0.28 - 0.075 - 0.16 - 0.336 0.84 0.089 0.029 6 3 3 8 5 40 13 15 21 Thermotoga maritime 1vqt OMPDC - 0.25 - 0.136 - 0.299 - 0.123 - 0.189 - 0.336 0.726 0.027 - 0.029 8 4 5 6 5 18 13 13 22 Plasmodium vivax 2ffc OMPDC - 0.31 - 0.145 - 0.196 - 0.132 - 0.189 - 0.336 0.557 - 0.045 - 0.076 4 7 7 6 10 12 9 11 23 Pyrococcus horikoshii OT4 2cze OMPDC - 0.27 - 0.164 - 0.28 - 0.132 - 0.208 - 0.308 0.689 0 - 0.076 6 16 7 7 13 26 20 19 Anaerococcus prevotii DSM OMPDC 6 3 8 2 4 3 6 8 24 20548 3r89 - 0.35 - 0.127 - 0.206 - 0.104 - 0.16 - 0.336 0.481 - 0.027 - 0.152 25 Streptomyces avermitilis 3v75 OMPDC - 0.32 - 0.064 - 0.271 - 0.075 - 0.142 - 0.355 0.387 0.036 - 0.095 7 6 4 7 3 7 7 13 26 Salmonella typhimurium 3f4w KGPDC - 0.17 - 0.164 - 0.224 0.019 - 0.132 - 0.318 0.217 0.188 0.257 11 12 3 8 12 16 26 38

Four SG proteins for IGPS subgroup

Four structural genomics proteins were categorized “indole glycerol phosphate synthase” function. Using the SALSA alignment table and scoring, the proteins: 1vc4, 3qja, 4fb7 and 3tsm were matched to the IGPS subclass (Table 4.15). The four putative IGPS SG proteins have high

MCS scores for the IGPS subclass. The values for the two SG proteins scored similar values but are missing active site residues. Taking a closer look at the consensus signatures alignment gave information about four SG protein having the correct annotation (Figure 4.30) The four putative

SG proteins, PDB 1vc4, 3qja, 4fb7 and 3tsm contain similar POOL predicted residues to the consensus signature. The scores reflect aligned residues, POOL predicted, and non-POOL predicted, in the spatial consensus positions to help differentiate between subclasses.

Five SG proteins for TrpA subgroup

Five structural genomics proteins were categorized with “tryptophan synthase alpha” function.

Using the SALSA alignment table and scoring, the proteins: 1ujp, 2e09, 1wq5, 3tha and 3nav were matched to the TrpA subclass (Table 4.15) were found to be correctly annotated for this function (Figure 4.31). For the five putative TrpA SG proteins, the MCS was comparable to the representative TrpA. A few residues aligned with similar residues but were not POOL predicted.

Taking a closer look at the consensus signatures alignment suggests that four SG proteins may have the correct annotation.

Figure 4.30 The four correctly annotated SG proteins in the IGPS subclass. A) The catalytic glutamates (position 1 & 16) and lysines

(position 2 & 10) are found in these structural genomics proteins. B) The table shows the POOL predicted residues in BOLD and uppercase which match for all the proteins.

Five SG protein for HisA/HisF subgroup

Five structural genomics proteins were categorized with “HisA/HisF” function. Using the

SALSA alignment table and scoring, the proteins: 2aon, 1vh7, 4gj1, 2agk and 1ka9 were matched to the HisA/HisF subclass (Table 4.15) and were found to be correctly annotated for this function (Figure 4.32). The catalytically important residues are in position 2 and 17. The SG proteins with HisA function contain a histidine (position 3) and arginine (position 13).The HisF will contain no histidine (in position 3) and a lysine (in position 13). There are consensus signature residues not POOL predicted in positions 19, 20 and 21, but the spatially aligned residues match the consensus signature residues. The MCS values match the HisA/HisF groups.

The SG protein 4gj1 scores high as a HisA function. Looking at the spatial alignment of 2agk would indicate HisA, the MCS score does not match since the POOL predictions are missing for the aspartate and serine. The SG proteins, 2a09, 1vh7 and 1ka9, match as a HisF function. They contain the similar trend and only have issues with the non-POOL predicted aspartate and serine.

These SG proteins can are found to be correctly annotated looking at the residues in the active sites.

Figure 4.31 The five SG proteins that matched the TrpA subclass. B) The table lists the residues found in that spatial position. The

POOL predictions for proline (position 5), glutamine (position 7) and phenylalanine (position 22) are not seen in these SG proteins which leads to lower MCS values. Further evaluation of the active site does indicate the residues have a consensus signature residue in the spatial positions.

Figure 4.32 The five SG proteins that matched the HisA/HisF subclass. This subclass is separated into two enzyme functions. The

HisA correlates to the PDBs 4gj1 and 2agk. The HisF correlates to 2a0n, 1vh7 and 1ka9. The positions 3 and 13 show variability in residue type. The histidine is conserved and POOL predicted in the HisA but not the HisF. In position 13, the arginine is found in

HisA and lysine in HisF. These two amino acids are similar in that they are both basic and both POOL predicted for the SG proteins, except 2agk.

Figure 4.33 The four SG proteins that matched the RPE subclass. There are minor differences found for between the key residues for these SG proteins and the consensus signature residues. In position 4, the conserved methionine is not predicted but a methionine is found in the spatial position for two of the SG proteins and a POOL predicted alanine is found in the spatial position for one SG protein. In position 19, one SG protein has a conserved methionine that is not POOL-predicted and another has a leucine instead. In position 14, a glutamine is found instead of the histidine for one SG protein. The histidine hydrogen bonds to the ligand; the glutamine could perform the same function.

Four SG proteins for RPE subgroup

Four structural genomics proteins were categorized with “ribulose phosphate epimerase” function. Using the SALSA alignment table and scoring, the proteins: 1tqx, 3cu2, 3inp and 3qc3 were matched to the RPE subclass (Table 4.15) were found to be correctly annotated for this function (Figure 4.33). There were only a few changes in select amino acids that aligned with the consensus signatures. The high MCS values indicate these SG proteins are correctly annotated.

Eight SG proteins for OMPDC subgroup

Eight structural genomics proteins were categorized with “orotidine monophosphate decarboxylase” function. Using the SALSA alignment table and scoring, the proteins: 2aqw,

3ldv, 1vqt, 2ffc, 2cz, 3r89, and 3v75 were matched to the OMPDC subclass (Table 4.15) and were found to be correctly annotated for this function (Figure 4.34). Position 13 does not predict the aspartate but in the spatially aligned position, there is an aspartate present. Position 14 changes amino acid residues, but POOL does predict those spatially aligned residues. The MCS values were high to indicate these SG proteins were correctly annotated.

One SG proteins for HPS subgroup

One structural genomics protein was categorized with “hexulose phosphate synthase” function.

Using the SALSA alignment table and scoring, the protein: 3f4w was matched to the HPS subclass (Table 4.15) and was found to be correctly annotated for this function (Figure 4.35).

The MCS score was ambiguous between the OMPDC, KGPDC and HPS. Upon further evaluation of the active site and spatially aligned residues, the SG protein is correctly annotated as a hexulose phosphate synthase.

Figure 4.34 The seven SG proteins that matched the OMPDC subclass. The D-x-x-K-D motif is found in the SG proteins. The reactive arginine (position 24) is found in all SG proteins. In position 13, an aspartate is not POOL predicted for three of the proteins but an aspartate is aligned in that spatial region for all the SG proteins. In position 14, the consensus signature identifies a conserved histidine. In four of the SG proteins, there is an asparagine or serine in its place.

Figure 4.35 The one SG protein that matched the KGPDC subclass. The SG protein was missing a POOL predicted aspartate in position 13 but did contain an aspartate in that spatial region. All other consensus signature residues align very well for this SG protein.

Out of the 26 SG proteins, 23 of the proteins are correctly annotated according to the SALSA method. The SG proteins with lowered MCS scores have POOL predicted residues missing but inspection to the active site show a similar residue in the position. Biochemical tests can be performed for these SG proteins their functions.

4.3 Preliminary results in the analysis of RPBB superfamily using SALSA-DT

In light of the time-consuming nature of the manual sorting of the RPBB superfamily, an automated method is preferred, mainly because of the time required for the tedious task of multiple structure alignments. A collaboration with the mathematics department and the college of computer and information science at NU lead to the development of the SALSA-DT method by implementing a graph theory Delaunay triangulation approach to identifying local active site matches, incorporating the predicted active site residues. The automatic process is described in detail in Chapter 3. The rank order files from POOL and the PDB coordinate files were run through the algorithm described. The preliminary results, presented here, were consistent with the manually generated SALSA table and the structural genomics proteins analysis is presented in this section.

4.3.1 Results from sorting the subclasses

The subgroup output file from the SALSA-DT algorithm. The heuristics introduced in this section are listed in Chapter 3.3.7. The output file contains grouping of similar proteins and matches to similar groups of proteins, with intergroups represented as three dash markings “ ---“.

POOL predicted residues are marked with an asterisk (*). The POOL predicted spatially matched residues appear in the same column. Evaluation of these columns will be performed to correctly identify the alignments. More groups of proteins, intragroups, are groups of proteins

that do not match with other groups and make their own grouping with a triple three dash marking “------“.

The RPBB superfamily sorted automatically by SALSA-DT. Thirty-one characterized protein structures and their POOL rank files were submitted to the program. The groups for these 31 proteins were sorted as:

SALSA-DT assignment Functional assignment 1:1DBT OMPDC 1:1DV7 OMPDC 1:1DQW OMPDC 1:1L2U OMPDC 1:2ZA1 OMPDC 1:3QW3 OMPDC 1:3L0K OMPDC --- 3:1QO2 HisA 3:1VZW HisA 3:2Y85 HisA --- 3:1THF HisF 3:1H5Y HisF 3:1OX6 HisF --- 7:1XBV KGPDC 7:3EXR KGPDC --- 8:3AJX HPS 8:P42405 HPS ------2:1GEQ TrpA 2:1QOP TrpA 2:1RD5 TrpA 2:1XC4 TrpA --- 4:1RPX RPE 4:2FLI RPE 4:1H1Y RPE 4:1TQJ RPE 4:3OVP RPE ------5:1PII IGPS 5:1I4N IGPS 5:2C3Z IGPS --- 6:1PII PRIA 6:1LBM PRIA

In the previous list (page 227), the integer in front of each PDB code indicates the group number, assigned by the automated methods, of which the protein is a member. For the output, the proteins within the groups are the only ones spatially aligned based on the tetrahedra matching.

The alignments between groups need to be further optimized. The results from sorting the superfamily are correct. The numbered groups correlate as: 1 = OMPDC, 2 = TrpA, 3=

HisA/HisF, 4 = RPE, 5 = IGPS, 6= PRAI, 7= KGPDC and 8 = HPS. All proteins are placed in the correct group based on the weighed input of predictions by POOL. To understand the groupings, an analysis is performed on the output alignment files (Table 4.16). In Table 4.16, one supergroup includes groups 1, 3, 7 and 8. Group 1 = OMPDC, Group 3 = HisA/HisF, Group 7 =

KGPDC and Group 8 = HPS. In the previous section, OMPDC, KGPDC and HPS are closely related with common conserved residues which have all been POOL predicted. The HisA/HisF group contains spatially similar positions for the consensus signatures as OMPDC, but with different amino acid residues. SALSA-DT split up the HisA/HisF group due to a few residues that vary outside of the consensus signature but are also POOL predicted in the subclass. Another supergroup identified by the automated method contains these matched groups: Group 2 = TrpA and Group 4 = RPE. TrpA and RPE share more regions of predicted residues along the eight β- strands than other subclasses. Another set of groups matched together includes Group 5 = IGPS and Group 6 = PRAI. PRAI and IGPS are back-to-back reactions of one another and share a common active site region. TrpA contains more residues in its consensus signature which may explain why it is not grouped with PRAI and IGPS. The sequence numbering is one problem that has arisen in this automated procedure. The output file will need to be reviewed to collect all

POOL predicted residues. With previously known consensus signatures, identifying the important residues was easy for this part. In Table 4.17, the consensus signatures are revealed.

Table 4.16 The raw classification of proteins from subclasses with grouped POOL alignments. The eight functional subclasses are organized in three groups, separated by ------. Supergroup1 contains groups 1,3,7 and 8, Supergroup 2 contains groups 2 and 4,

Supergroup 3 contains groups 5 and 6.

1:1dbt -- H88* P182* D65* F58* -- K62* D60* K33* D11* -- G214 R215* 1:1dv7 -- H98* P180* D75* I68* A94 K72* D70* K42* D20* H128* G202 R203* 1:1dqw C33* H122* P202* D96* F89* -- K93* D91* K59* D37* H61* G234 R235* 1:1l2u -- H99* P189* D76* F69* -- K73* D71* K44* D22* -- G221 R222* 1:2za1 C19* -- P264* -- I134 S161 K138* D136* K102* D23* -- -- R294* 1:3qw3 C17* -- P199* -- V80 -- K84* D82* K49* D21* -- G230 R229* 1:3l0k C31* H119 P193* E146 F86* -- K90* D88* K57* D35* H59* G226* R227 --- 3:1qo2 D51* H48* -- P5* -- D8* -- D127* T164* T180 L126 D176* S125 3:1vzw D53* H50* R64 P8* -- D11* -- D130* T166* G128 R100* -- S196* 3:2y85 D53* H50* H64 P8 -- D11* -- D130* T170* G128* R100* D175* S200* --- 3:1thf -- R5* A8* C9* D11* D51* K99* D130* S144 T171 R175 D176* S201* 3:1h5y -- R6* P9* C10* D12* D54* K102* D133* G147 T174 R178 D179* S204 3:1ox6 H487* R239* A242 C243* D245 N298 K360* D404* G442 -- K473* D474* S499* --- 7:1xbv E33* P200* R195* C39* -- K64* T85* D62* D11* E112 H136 G35* R192* 7:3exr E35* P205* T200 C90* D69* K66* T87 D64* D13* E117* H141* G37* R197* --- 8:3ajx A1006* -- E1030* -- T1082* D1064* K1061* D1059* D1109* T1033* M1060 D1008* G1186* -- H1134* 8:P42405 A6* -- E30* -- L84 -- K61* D59* I81* T33* L60 D8* G187* D109 ------2:1geq -- Y10* E36* G38* P40* D43* P44* D47* I51* Q52* M86* Y88* -- D116* -- Y161* G197* F198* G220* 2:1qop Y4* F22* E49* G51 P53* D56* P57* D60* I64* Q65* L100 Y102* Y115* D130* Y173* Y175* G211 -- G234 2:1xc4 Y4* F22* E49* G51* -- -- P57 D60* I64 Q65 L100* Y102 Y115* D130* Y173* Y175* G211 -- G234* 2:1rd5 -- Y23* E50* G52* P54* D57* P58* D61* I65* Q66* L100 Y102* -- D126* -- Y171* G207* F208* G230* --- 4:1rpx S16* H41* H74* H98* M147* P151* E100* D72* M45* P15* E183* M76* D43* D185* G207 4:2fli S9* H34* H67* H91* M138* P142* E93* D65* M38* P8* E174* M69* D36* D176* G198 4:1h1y S11* H36* H69* H93* M144 P148 E95* D67* M40* P10* E176 M71* D38* D178* G200* 4:1tqj S10* H35* H68* H92* M141* P145* E94* D66* M39* P9* E177* M70* D37* D179* G201 4:3ovp S10* H35* H70* H94* M141 P145 E96* D68* M39 P9* E173* M72 D37* D175* G197* ------5:1PII E53* K55* S58* S85* D89* F93* K114* D115* F116* L135* E163* N184* R186* E214* S215* L234* G236 5:1I4N E47* K49* S52* S79 E83 F87* K108* D109* F110 L129 E157 N179* R181* E209* S210* L229* G231* 5:2C3Z E51* K53* S56 S81* E85* F89 K110* D111* F112* L131* E159* N180* R182* E210* S211* L231* G233* --- 6:1pii K258* C260* G280* R289* H334* D379* A405 -- D425* -- 6:1lbm K5* C7* G27* R36* H83* D126* S157* G158* D178* E184*

Table 4.17 The organized and detailed analysis from SALSA-DT for IGPS, TrpA, PRIA and HisA/HisF subclasses. The subclasses are labeled and this is NOT an overall alignment. Consensus signatures are underlined *AA# POOL predicted residues ---- indicated incorrect alignments. GOLD is not aligned, not POOL predicted. RED is aligned, not POOL predicted. GREEN is not predicted, not aligned. PURPLE is incorrect. - - lines are a space with no residue aligned.

For IGPS, the PDB 1l4n is missing two residues F110 and E157 (in red) and these are not POOL predicted; the other proteins do have predictions for that position. The consensus signatures

(underlined) are located predicted. These CS correspond to the same position found by the manual SALSA sorting.

For TrpA, the two PDBs 1qop and 1xc4 are missing phenylalanines. The GOLD indicates the residues are not predicted and not aligned. In the original SALSA table this position contains a

POOL predicted phenylalanine. The consensus signatures (underlined) are all predicted for PDB

1geq and 1rd5, PDB 1qop and 1xc4 are missing one residues but predict all other CS positions.

These correspond to positions found by the manual SALSA sorting.

For PRAI, there is a perfect match for this group. The consensus signatures (underlined) are found.

For HisA/HisF, there were few issues with aligning the wrong residues which are colored

PURPLE. A POOL predicted H487 is aligned with H48/H50/H50. The histidine is only found in the HisA proteins, 1qo2/1vzw/2y85. This is an incorrect alignment. The residue, D171 for PDB

1vzw is POOL predicted but does not appear in this table. The residue D176 for PDB 1qo2 is incorrectly aligned, it should be D169 which is POOL predicted and found in the SALSA table.

The residue S125 is not a POOL predicted residue for PDB 1qo2, it also does not belong in the alignment, the POOL predicted A194 should be in this position. The T171 (PDB: 1thf) and T174

(PDB 1h5y) are colored green because they are not predicted and not aligned. They should be aligned with the T164/T166/T170 group. Another misalignment (italicized) occurs with the lysines K99/K102/K360; they should be matched with the arginines R98/R100/R100. R98 is not

POOL predicted for 1qo2. These mistakes with the alignment will be taken into account for the next optimization of the chemical similarity heurtistic.

Table 4.18 The organized and detailed analysis from SALSA-DT for RPE, OMPDC, KGPDC and HPS subclasses. The subclasses are labeled and this is NOT an overall alignment between the subclasses. Consensus signatures are underlined *AA# POOL predicted residues ---- indicated incorrect alignments. GOLD is not aligned, not POOL predicted. RED is aligned, not POOL predicted.

GREEN is not predicted, not aligned. PURPLE is incorrect. - - lines are a space with no residue aligned.

For the RPE, the protein PDB 3ovp does not have POOL predictions for M141/M39 and M72.

For OMPDC, the protein 3l0k has H119 in RED because it aligns but is not POOL predicted. In the original SALSA table it is shown as POOL predicted. This is possibly a result of the ranking in POOL. There is an incorrect residue, E146 for PDB 3l0k. The PDBs 2za1, 3qw3 and 3lok do have an aspartate in that position but it is not POOL predicted. For PDB 3l0k, the R227 is RED because it is not POOL predicted.

For KGPDC, the GOLD shows a missing residue for PDB 1xbv. PDB 1xbv should contain an aspartate in alignment but the D67 is not POOL predicted for that residue. This is why it is not showing up in the table.

For HPS, PDB P42405, contains an I81 incorrectly aligned. D109 is supposed to align with

D1109 for PDB 3ajx. This could be due to D109 not being POOL predicted. A histidine is missing for P42405. This histidine, H134 is POOL predicted but does not show up in the table.

IT should align with H1134 for PDB 3ajx

The explanations for residues that are supposed to be POOL predicted could be because the rankings for this protein were generated with an earlier version of POOL. This will be fixed the next time for the re-run of the superfamily. The misalignments is an issue with the SALSA-DT.

This could be due to the tetrahedra weighting favoring certain POOL residues.

Overall, there was one protein that contained all residues for the consensus signatures. In spite of the differences in predictions of specific residues for the other proteins, the method is very successful in grouping the previously characterized members of the superfamily correctly according to biochemical function. These results are very promising. The next phase for optimizing the methods will take into account updated rank files, more stringent chemical

similarity scoring and better ordered numbers for the sequences. To note, the SALSA table took over 14 months to sort out manually. The SALSA-DT obtained similar results within minutes.

The development of SALSA-DT took over 10 months but can be used to sort other superfamilies! This is a huge development for SALSA!

4.3.2 Annotating structural genomics proteins using SALSA-DT

With the groups sorted and the consensus signatures confirmed for the SALSA-DT run, the structural genomics proteins can be analyzed. SALSA-DT analyzed 24 structural genomics proteins. The output file from the SALSA-DT algorithm categorizes the matches within the three groups. The SG proteins will need to be manually inspected to properly categorize the SG proteins.

There were two structural genomics proteins, PDB 1y0e and 1yxy, submitted as negative controls to test if they would be grouped into any of the subclasses. These two proteins do not match this superfamily. They were not matched with any subclasses. The PDBs 3ldv, 3f4w,

3v75, 2cze, 4gj1 and 3r89 were matched to Supergroup 1. The PDBs 2e09, 1wq5, 3qc3, 3inp,

3nav, 1tqx, 1ujp, 3cu2 and 3tha were matched to Supergroup 2. The PDBs 1vh7, 2a0n, 1ka9 and

2aqk were matched to Supergroup 3.

For Supergroup 1, the OMPDC SG proteins included PDBs 3ldv, 3v75, 3c75, 2cze and 3r89.

These had correct residues matches for the subclass and were verified in SALSA-DT. The

HisA/HisF SG protein, PDB 4gj1, was a correct match for the subclass and verified in SALSA-

DT. The KGPDC SG protein, PDB 3f4w, was a correct match for the subclass and verified in

SALSA-DT.

Table 4.19 Results for clustering the SG proteins. SALSA-DT found three groups. Within the groups, there are subclasses that match (boxed #:PDBID) with submitted SG proteins (listed as match #). For SG proteins, 26, 24, 31, 20, 32 and 29, SALSA-DT found it matched with members from Supergroup 1 (subclasses 1, 3, 7 and 8). For SG proteins 21, 14, 28, 25, 27, 10,

11, 23 and 30, they matched with members from Supergroup 2 (subclasses 2 and 4).For SG proteins 12, 17, 9 and 8, they matched with members from Supergroup 3 (subclasses 5 and 6).

The SG proteins 19, 22 and 13 did not match any of the Supergroups (1-3). The proteins 15 and

16 are not from this superfamily.

Supergroup 1 1:1DBT 3:1QO2 3:1THF 7:1XBV 8:3AJX --- 1:1DV7 3:1VZW 3:1H5Y 7:3EXR 8:P42405 1:1DQW 3:2Y85 3:1OX6 1:1L2U 1:2ZA1 1:3QW3 1:3L0K --- match 1 26:3LDV 24:3F4W 31:3V75 20:2CZE 32:4GJ1 29:3R89 ------Supergroup 2 2:1GEQ 4:1RPX --- 2:1QOP 4:2FLI 2:1RD5 4:1H1Y 2:1kfk 4:1TQJ 2:1xc4 4:3OVP --- match 2 21:2E09 14:1WQ5 28:3QC3 25:3INP 27:3NAV 10:1TQX 11:1ujp 23:3CU2 30:3THA ------Supergroup 3 5:1PII 6:1PII --- 5:1I4N 6:1LBM 5:2C3Z --- match 3 12:1VH7 17:2A0N 9:1KA9 18:2AGK ------re-investigate 19:2AQW 22:2FFC 13:1VQT ------does not match 15:1Y0E 16:1YXY

For Supergroup 2, the TrpA SG proteins included PDB 2e09, 1wq5, 3nav, 1ujp and 3tha. These all had correct residue matches for the subclass and were verified in SALSA-DT. The RPE SG proteins included PDB 3qc3, 3inp, 1tqx and 3cu2. These all contained correct residue matches for the subclass and were verified in SALSA-DT.

For Supergroup 3 there were intersting results. The IGPS SG proteins were not analyzed at the moment. They will be analyzed in the future with the optimized method. The PDBs 1vh7, 2a0n,

1ka9 and 2aqk were matched with the wrong group. However they aligned together. This can be explained by the lack of POOL predicted residues for the consensus signatures. These will need to be further evaluated. These SG proteins should have matched with the HisA/HisF group.

From the trends identified in the SG analysis, there seems to be an issue with handling non-

POOL predicted residues. This will need to be further analyzed in the development of the method. In the previous, manually generated SALSA analysis, the SG proteins were aligned and visually inspected. If a residue matched but was not POOL prdicted, the residue was scored as a match. In the automated method, this is not the case. The POOL predictions play a huge role in identifying the similar residues.

4.4 Summary

The SALSA methodology was successful in sorting the RPBB superfamily. The methodology was optimized in the initial phase to identify the best alignment and scoring to develop the most reliable method possible. The RPBB superfamily contains eight functional subclasses that were sorted according to POOL predictions. The active sites were further identified as similar for the subclasses superfamily in terms of a conserved active site architecture. The eight beta sheets contain predicted residues in specific spatial positions that are unique for each subclass. This

was verified by the SALSA program. The four subclasses, OMPDC, KGPDC, HPS and RPE clearly demonstrate this architecture. The three subclasses PRAI, IGPS and TrpA are very interesting because they are sequential reactions in the biosynthesis of tryptophan and share a conserved architecture. The active site residues are different but similar residues act as the acid- base residues that are located along the β-strands 1, 2, 3 and 5. The Amadori rearrangement is similar in two subclasses, the HisA and PRAI. The residues are chemically similar and in similar spatial positions. The sorting of this superfamily is consistent with the notion that there was a common progenitor to allow the conservation of the active site architecture.

The optimized method SALSA-DT lead to a faster output and sorting of the method. Though the method is not perfect, it will be further optimized in the next round of development. It does introduce a new structural alignment idea and speeds up the SALSA method to give more information. There was an issue for the SG proteins but it was an issue based on the POOL predictions. The program did group the aligned POOL predicted residues making the predictions of the groups that the SG proteins fall into. The heuristics of the SALSA-DT will need to be further optimized if the SG proteins have some residues that are not POOL predicted. Overall this automated method provides exciting new insight on what to focus on further in the next optimized methodology.

References

1. Murzin, A. G.; Brenner, S. E.; Hubbard, T.; Chothia, C., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247 (4),

536-40.

2. Henn-Sax, M.; Hocker, B.; Wilmanns, M.; Sterner, R., Divergent evolution of (betaalpha)8- barrel enzymes. Biol Chem 2001, 382 (9), 1315-20.

3. Gandhimathi, A.; Nair, A. G.; Sowdhamini, R., PASS2 version 4: An update to the database of structure-based sequence alignments of structural domain superfamilies. Nucleic Acids Res

2012, 40 (Database issue), D531-4.

4. Brown, S.; Babbitt, P., Using the Structure-function Linkage Database to characterize functional domains in enzymes. Curr Protoc Bioinformatics 2006, Chapter 2, Unit 2 10.

5. Liebold, C.; List, F.; Kalbitzer, H. R.; Sterner, R.; Brunner, E., The interaction of ammonia and xenon with the imidazole glycerol phosphate synthase from Thermotoga maritima as detected by NMR spectroscopy. Protein Sci 2010, 19 (9), 1774-82.

6. Gerlt, J. A.; Babbitt, P. C., Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. Annu Rev Biochem 2001, 70, 209-

46.

7. Wilmanns, M.; Priestle, J. P.; Niermann, T.; Jansonius, J. N., Three-dimensional structure of the bifunctional enzyme phosphoribosylanthranilate isomerase: indoleglycerolphosphate synthase from Escherichia coli refined at 2.0 A resolution. J Mol Biol 1992, 223 (2), 477-507.

8. Schneider, B.; Knochel, T.; Darimont, B.; Hennig, M.; Dietrich, S.; Babinger, K.; Kirschner,

K.; Sterner, R., Role of the N-terminal extension of the (betaalpha)8-barrel enzyme indole-3-

glycerol phosphate synthase for its fold, stability, and catalytic activity. Biochemistry 2005, 44

(50), 16405-12.

9. Weyand, M.; Schlichting, I., Crystal structure of wild-type tryptophan synthase complexed with the natural substrate indole-3-glycerol phosphate. Biochemistry 1999, 38 (50), 16469-80.

10. Wilmanns, M.; Hyde, C. C.; Davies, D. R.; Kirschner, K.; Jansonius, J. N., Structural conservation in parallel beta/alpha-barrel enzymes that catalyze three sequential reactions in the pathway of tryptophan biosynthesis. Biochemistry 1991, 30 (38), 9161-9.

11. Eberhard, M.; Tsai-Pflugfelder, M.; Bolewska, K.; Hommel, U.; Kirschner, K.,

Indoleglycerol phosphate synthase-phosphoribosyl anthranilate isomerase: comparison of the bifunctional enzyme from Escherichia coli with engineered monofunctional domains.

Biochemistry 1995, 34 (16), 5419-28.

12. Konagurthu, A. S.; Whisstock, J. C.; Stuckey, P. J.; Lesk, A. M., MUSTANG: a multiple structural alignment algorithm. Proteins 2006, 64 (3), 559-74.

13. Krieger, E.; Koraimann, G.; Vriend, G., Increasing the precision of comparative models with YASARA NOVA--a self-parameterizing force field. Proteins 2002, 47 (3), 393-402.

14. Stehlin, C.; Dahm, A.; Kirschner, K., Deletion mutagenesis as a test of evolutionary relatedness of indoleglycerol phosphate synthase with other TIM barrel enzymes. FEBS Lett

1997, 403 (3), 268-72.

15. Henn-Sax, M.; Thoma, R.; Schmidt, S.; Hennig, M.; Kirschner, K.; Sterner, R., Two

(betaalpha)(8)-barrel enzymes of histidine and tryptophan biosynthesis have similar reaction mechanisms and common strategies for protecting their labile substrates. Biochemistry 2002, 41

(40), 12032-42.

16. Hommel, U.; Eberhard, M.; Kirschner, K., Phosphoribosyl anthranilate isomerase catalyzes a reversible amadori reaction. Biochemistry 1995, 34 (16), 5429-39.

17. Darimont, B.; Stehlin, C.; Szadkowski, H.; Kirschner, K., Mutational analysis of the active site of indoleglycerol phosphate synthase from Escherichia coli. Protein Sci 1998, 7 (5), 1221-

32.

18. Hennig, M.; Darimont, B. D.; Jansonius, J. N.; Kirschner, K., The catalytic mechanism of indole-3-glycerol phosphate synthase: crystal structures of complexes of the enzyme from

Sulfolobus solfataricus with substrate analogue, substrate, and product. J Mol Biol 2002, 319 (3),

757-66.

19. Lang, D.; Thoma, R.; Henn-Sax, M.; Sterner, R.; Wilmanns, M., Structural evidence for evolution of the beta/alpha barrel scaffold by gene duplication and fusion. Science 2000, 289

(5484), 1546-50.

20. Alifano, P.; Fani, R.; Lio, P.; Lazcano, A.; Bazzicalupo, M.; Carlomagno, M. S.; Bruni, C.

B., Histidine biosynthetic pathway and genes: structure, regulation, and evolution. Microbiol Rev

1996, 60 (1), 44-69.

21. Chaudhuri, B. N.; Lange, S. C.; Myers, R. S.; Davisson, V. J.; Smith, J. L., Toward understanding the mechanism of the complex cyclization reaction catalyzed by imidazole glycerolphosphate synthase: crystal structures of a ternary complex and the free enzyme.

Biochemistry 2003, 42 (23), 7003-12.

22. Douangamath, A.; Walker, M.; Beismann-Driemeyer, S.; Vega-Fernandez, M. C.; Sterner,

R.; Wilmanns, M., Structural evidence for ammonia tunneling across the (beta alpha)(8) barrel of the imidazole glycerol phosphate synthase bienzyme complex. Structure 2002, 10 (2), 185-93.

23. Quevillon-Cheruel, S.; Leulliot, N.; Graille, M.; Blondeau, K.; Janin, J.; van Tilbeurgh, H.,

Crystal structure of the yeast His6 enzyme suggests a reaction mechanism. Protein Sci 2006, 15

(6), 1516-21.

24. Beismann-Driemeyer, S.; Sterner, R., Imidazole glycerol phosphate synthase from

Thermotoga maritima. Quaternary structure, steady-state kinetics, and reaction mechanism of the bienzyme complex. J Biol Chem 2001, 276 (23), 20387-96.

25. Caruthers, J.; Bosch, J.; Buckner, F.; Van Voorhis, W.; Myler, P.; Worthey, E.; Mehlin, C.;

Boni, E.; DeTitta, G.; Luft, J.; Lauricella, A.; Kalyuzhniy, O.; Anderson, L.; Zucker, F.; Soltis,

M.; Hol, W. G., Structure of a ribulose 5-phosphate 3-epimerase from Plasmodium falciparum.

Proteins 2006, 62 (2), 338-42.

26. Tsang, W. Y.; Wood, B. M.; Wong, F. M.; Wu, W.; Gerlt, J. A.; Amyes, T. L.; Richard, J.

P., Proton transfer from C-6 of uridine 5'-monophosphate catalyzed by orotidine 5'- monophosphate decarboxylase: formation and stability of a vinyl carbanion intermediate and the effect of a 5-fluoro substituent. J Am Chem Soc 2012, 134 (35), 14580-94.

27. Toth, K.; Amyes, T. L.; Wood, B. M.; Chan, K.; Gerlt, J. A.; Richard, J. P., Product deuterium isotope effects for orotidine 5'-monophosphate decarboxylase: effect of changing substrate and enzyme structure on the partitioning of the vinyl carbanion reaction intermediate. J

Am Chem Soc 2010, 132 (20), 7018-24.

28. Desai, B. J.; Wood, B. M.; Fedorov, A. A.; Fedorov, E. V.; Goryanova, B.; Amyes, T. L.;

Richard, J. P.; Almo, S. C.; Gerlt, J. A., Conformational changes in orotidine 5'-monophosphate decarboxylase: a structure-based explanation for how the 5'-phosphate group activates the enzyme. Biochemistry 2012, 51 (43), 8665-78.

29. Wu, N.; Gillon, W.; Pai, E. F., Mapping the active site-ligand interactions of orotidine 5'- monophosphate decarboxylase by crystallography. Biochemistry 2002, 41 (12), 4002-11.

30. Chan, K. K.; Wood, B. M.; Fedorov, A. A.; Fedorov, E. V.; Imker, H. J.; Amyes, T. L.;

Richard, J. P.; Almo, S. C.; Gerlt, J. A., Mechanism of the orotidine 5'-monophosphate decarboxylase-catalyzed reaction: evidence for substrate destabilization. Biochemistry 2009, 48

(24), 5518-31.

31. Wu, N.; Mo, Y.; Gao, J.; Pai, E. F., Electrostatic stress in catalysis: structure and mechanism of the enzyme orotidine monophosphate decarboxylase. Proc Natl Acad Sci U S A

2000, 97 (5), 2017-22.

32. Akana, J.; Fedorov, A. A.; Fedorov, E.; Novak, W. R.; Babbitt, P. C.; Almo, S. C.; Gerlt, J.

A., D-Ribulose 5-phosphate 3-epimerase: functional and structural relationships to members of the ribulose-phosphate binding (beta/alpha)8-barrel superfamily. Biochemistry 2006, 45 (8),

2493-503.

33. Li, G. L.; Liu, X.; Nan, J.; Brostromer, E.; Li, L. F.; Su, X. D., Open-closed conformational change revealed by the crystal structures of 3-keto-L-gulonate 6-phosphate decarboxylase from Streptococcus mutans. Biochem Biophys Res Commun 2009, 381 (3), 429-

33.

34. Yew, W. S.; Wise, E. L.; Rayment, I.; Gerlt, J. A., Evolution of enzymatic activities in the orotidine 5'-monophosphate decarboxylase suprafamily: mechanistic evidence for a proton relay system in the active site of 3-keto-L-gulonate 6-phosphate decarboxylase. Biochemistry 2004, 43

(21), 6427-37.

35. Wise, E. L.; Yew, W. S.; Gerlt, J. A.; Rayment, I., Evolution of enzymatic activities in the orotidine 5'-monophosphate decarboxylase suprafamily: crystallographic evidence for a proton

relay system in the active site of 3-keto-L-gulonate 6-phosphate decarboxylase. Biochemistry

2004, 43 (21), 6438-46.

36. Kato, N.; Yurimoto, H.; Thauer, R. K., The physiological role of the ribulose monophosphate pathway in bacteria and archaea. Biosci Biotechnol Biochem 2006, 70 (1), 10-21.

37. Orita, I.; Kita, A.; Yurimoto, H.; Kato, N.; Sakai, Y.; Miki, K., Crystal structure of 3- hexulose-6-phosphate synthase, a member of the orotidine 5'-monophosphate decarboxylase suprafamily. Proteins 2010, 78 (16), 3488-92.

38. Wise, E. L.; Yew, W. S.; Akana, J.; Gerlt, J. A.; Rayment, I., Evolution of enzymatic activities in the orotidine 5'-monophosphate decarboxylase suprafamily: structural basis for catalytic promiscuity in wild-type and designed mutants of 3-keto-L-gulonate 6-phosphate decarboxylase. Biochemistry 2005, 44 (6), 1816-23.

39. Sievers, F.; Wilm, A.; Dineen, D.; Gibson, T. J.; Karplus, K.; Li, W.; Lopez, R.;

McWilliam, H.; Remmert, M.; Soding, J.; Thompson, J. D.; Higgins, D. G., Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst

Biol 2011, 7, 539.

40. Yasueda, H.; Kawahara, Y.; Sugimoto, S., Bacillus subtilis yckG and yckF encode two key enzymes of the ribulose monophosphate pathway used by methylotrophs, and yckH is required for their expression. J Bacteriol 1999, 181 (23), 7154-60.

41. Holliday, G. L.; Andreini, C.; Fischer, J. D.; Rahman, S. A.; Almonacid, D. E.; Williams,

S. T.; Pearson, W. R., MACiE: exploring the diversity of biochemical reactions. Nucleic Acids

Res 2012, 40 (Database issue), D783-9.

42. Sobolev, V.; Sorokine, A.; Prilusky, J.; Abola, E. E.; Edelman, M., Automated analysis of interatomic contacts in proteins. Bioinformatics 1999, 15 (4), 327-32.

43. Jelakovic, S.; Kopriva, S.; Suss, K. H.; Schulz, G. E., Structure and catalytic mechanism of the cytosolic D-ribulose-5-phosphate 3-epimerase from rice. J Mol Biol 2003, 326 (1), 127-35.

44. Kopp, J.; Kopriva, S.; Suss, K. H.; Schulz, G. E., Structure and mechanism of the amphibolic enzyme D-ribulose-5-phosphate 3-epimerase from potato chloroplasts. J Mol Biol

1999, 287 (4), 761-71.

45. Iiams, V.; Desai, B. J.; Fedorov, A. A.; Fedorov, E. V.; Almo, S. C.; Gerlt, J. A.,

Mechanism of the orotidine 5'-monophosphate decarboxylase-catalyzed reaction: importance of residues in the orotate binding site. Biochemistry 2011, 50 (39), 8497-507.

46. Goryanova, B.; Goldman, L. M.; Amyes, T. L.; Gerlt, J. A.; Richard, J. P., Role of a guanidinium cation-phosphodianion pair in stabilizing the vinyl carbanion intermediate of orotidine 5'-phosphate decarboxylase-catalyzed reactions. Biochemistry 2013, 52 (42), 7500-11.

47. Wise, E. L.; Yew, W. S.; Gerlt, J. A.; Rayment, I., Structural evidence for a 1,2-enediolate intermediate in the reaction catalyzed by 3-keto-L-gulonate 6-phosphate decarboxylase, a member of the orotidine 5'-monophosphate decarboxylase suprafamily. Biochemistry 2003, 42

(42), 12133-42.

48. Wise, E. L.; Rayment, I., Understanding the importance of protein structure to nature's routes for divergent evolution in TIM barrel enzymes. Acc Chem Res 2004, 37 (3), 149-58.

49. Yew, W. S.; Akana, J.; Wise, E. L.; Rayment, I.; Gerlt, J. A., Evolution of enzymatic activities in the orotidine 5'-monophosphate decarboxylase suprafamily: enhancing the promiscuous D-arabino-hex-3-ulose 6-phosphate synthase reaction catalyzed by 3-keto-L- gulonate 6-phosphate decarboxylase. Biochemistry 2005, 44 (6), 1807-15.

50. Gao, J., Catalysis by enzyme conformational change as illustrated by orotidine 5'- monophosphate decarboxylase. Curr Opin Struct Biol 2003, 13 (2), 184-92.

51. Mitsui, R.; Omori, M.; Kitazawa, H.; Tanaka, M., Formaldehyde-limited cultivation of a newly isolated methylotrophic bacterium, Methylobacterium sp. MF1: enzymatic analysis related to C1 metabolism. Journal of bioscience and bioengineering 2005, 99 (1), 18-22.

52. Wise, E.; Yew, W. S.; Babbitt, P. C.; Gerlt, J. A.; Rayment, I., Homologous (beta/alpha)8- barrel enzymes that catalyze unrelated reactions: orotidine 5'-monophosphate decarboxylase and

3-keto-L-gulonate 6-phosphate decarboxylase. Biochemistry 2002, 41 (12), 3861-9.

Chapter 5

Concluding remarks and future directions

5.1 Concluding remarks

Computational tools apply theoretical methods to mimic or model the behavior of molecules.

The research described in this dissertation deals with the application of various computational tools used in drug design and protein function annotation. In Part 1, computational tools helped in the design of xanthine-based antagonists for the human A2AAR which is overexpressed in hypoxic tumors, which are difficult to treat with conventional radiation or chemotherapy treatments. In Part II, computational tools helped in the development of the methodology,

SALSA, to reliably assign function to groups of structural genomics proteins and gave insight into sorting superfamilies of proteins by their active sites.

Chapter 1 details the background of computational tools, visualization, molecular mechanics, molecular dynamics simulation, comparative modeling, functional site prediction and molecular docking, in drug discovery. The drug discovery process is sped up by the use of new methods to screen which molecules will be better for optimization. In Chapter 2, the research highlights the investigation of identifying a promising molecule for the A2AAR target for hypoxic tumors.

Previous research showed an antagonist of the A2AAR would be a promising immunotherapy.

Building upon the basic natural antagonist, caffeine, and incorporating Lipinski’s rule of five, a class of xanthine-based molecules was investigated. The goal of this project was to find a molecule with favorable properties, such as increased water solubility and decreased penetration of the blood-brain-barrier. The availability of two crystal structures co-crystalized with an agonist and antagonist gave more information on the types of interactions (hydrogen-bonding, van der Waals) that are needed for good binding to the target receptor. Molecular docking studies help design and prioritize the synthesis of PEGylated analogues to establish a lead compound.

The utilization of PEG aided in the stability and solubility of the molecule. Functional assay

performed provided promising data for the future of this molecule to move forward. The functional assay verified in an in vitro system provides the step forward for in vivo studies. The second goal of the project was to use this molecule as an imaging agent to better identify hypoxic tumors. The advantage of the PEGylated region of the lead molecule provided the option of labeling the compound with either an 18F or 123I tag to conduct biodistribution studies in rodent models. The biodistribution studies will give more information about whether the molecule is able to identify hypoxic tumor regions of different sizes and if the molecule does not cross the

BBB. The A2AAR docking model is validation and can be used by others to perform more docking studies for agonists and antagonists.

Chapter 3 describes the computational tools utilized in the development of the method SALSA and SALSA-DT. Functional site predictors are used to identify active site residues to create unique signatures for each subclass. Incorporating multiple structure alignment tools, a local spatial pattern at the active site is defined for each functional type and these can be used to sort a superfamily according to biochemical function. Further, identified SG members of this superfamily can be screened for correct or incorrect annotation. SALSA-DT utilizes Delaunay triangulation to match the local active sites of 3D protein structures for many-fold faster structural alignments within the superfamily. In Chapter 4, the development of SALSA and

SALSA-DT were applied to the ribulose-phosphate binding barrel (RPBB) superfamily. SALSA and SALSA-DT were shown to sort the RPBB superfamily into eight subclasses according to consensus signatures to describe the function based on local structure at the active site. With the established consensus signatures, a total of 27 structural genomics proteins with structural similarity matched to the RPBB were investigated. Of those proteins, 24 were reliably assigned correct annotation belonging to the RPBB superfamily. The remaining three structural genomics

proteins were possibly identified as the correct annotation due to the low MCS values found by

SALSA due to their non-predicted active site residues by POOL. Additionally, SALSA and

SALSA-DT revealed more information about the structural architecture of the active sites within the ribulose phosphate epimerases (RPE), hexulose phosphate synthases (HPS), orotidine monophosphate decaboxylases (OMPDC) and keto-gulonate-phosphate decarboxylases

(KGPDC) subclasses. The active site architecture for these subclasses is very similar among the predicted residues that fall within the eight beta-strands. However, the chemical type of these active site residues differs between each subclass.

5.3 Future directions

With the development and automation of SALSA-DT, it provides the promising application to more superfamilies and structural genomics proteins. The initial SALSA methodology fell short in that the length of time required to perform the multiple structure alignments between subclasses was too long for wide application across many superfamilies. The collaboration with the CCIS and Mathematics groups resulted in the development of a reliable and quicker matching algorithm to sort proteins into subclasses and to match proteins of unknown function against a large set of proteins of known function. The next phase for the automation is to extend the matching between all subclasses for an overall superfamily alignment. The integration of

Pachner graphs,1 as an alternative triangulation method, will extend matching of tetrahedra in a different way which can aid in overall superfamily alignment. Another option to better matching is to apply more weight to the highly ranked POOL predicted residues. Currently, the set of top

50 residues is used as the cut-off for each protein structure. As it is observed with the matching within subclasses, prioritizing the top residues into the tetrahedra may fix the issue of lesser ranked residues being matched incorrectly with the tetrahedra.

Applying SALSA-DT to more well-studied superfamilies will serve as further validation for the method. In the long term, the creation of a superfamily database, sorting many different superfamilies would be an exciting addition to other databases such as the Structure Function

Linkage Database (SFLD)2, Structural Classification Of Proteins (SCOP)3 and PASS24. The advantage is that SALSA-DT will provide sorting based on the local active site versus the global similarity as applied by the three databases. Additionally, the creation of a database for SG proteins with POOL predictions may provide faster identification of function for any others interested in the functionally important residues.

These two projects have demonstrated the powerful applications computational tools make possible and show the broad application to research in drug discovery and protein function annotation.

References

1. Burton, B. A., The pachner graph and the simplification of 3-sphere triangulations. In

Proceedings of the twenty-seventh annual symposium on Computational geometry, ACM: Paris,

France, 2011; pp 153-162.

2. Akiva, E.; Brown, S.; Almonacid, D. E.; Barber, A. E., 2nd; Custer, A. F.; Hicks, M. A.;

Huang, C. C.; Lauck, F.; Mashiyama, S. T.; Meng, E. C.; Mischel, D.; Morris, J. H.; Ojha, S.;

Schnoes, A. M.; Stryke, D.; Yunes, J. M.; Ferrin, T. E.; Holliday, G. L.; Babbitt, P. C., The structure-function linkage database. Nucleic Acids Res 2014, 42 (1), D521-30.

3. Murzin, A. G.; Brenner, S. E.; Hubbard, T.; Chothia, C., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247 (4),

536-40.

4. Gandhimathi, A.; Nair, A. G.; Sowdhamini, R., PASS2 version 4: An update to the database of structure-based sequence alignments of structural domain superfamilies. Nucleic

Acids Res 2012, 40 (Database issue), D531-4.

CURRICULUM VITAE

JOSLYNN S. LEE Northeastern University E-mail: [email protected] Department of Chemistry and Chemical Biology Personal: (617) 335-8022 360 Huntington Ave Office: (617) 373-4492 Boston, MA 02115 www.northeastern.edu/org/

EDUCATION Northeastern University, Boston, MA Ph.D., Department of Chemistry and Chemical Biology 09/2008-02/2014 Advisors: Dr. Mary Jo Ondrechen and Co-Advisor: Dr. Graham Jones Title: Applications of molecular modeling techniques in the design of xanthine based adenosine receptor antagonists and the development of the protein function annotation method SALSA Fort Lewis College, Durango, CO 05/2006 B.S., Major in Chemistry and Cellular & Molecular Biology Advisor: Dr. Leslie Sommerville

TEACHING EXPERIENCE 05-08/2011, 05- Northeastern University 08/2012, 05- Undergraduate Research Mentor, Ondrechen Research Group 08/2013 Worked closely with five summer undergraduate research students who presented their work at national conferences 05/2012, 05/2011 Graduate Student Instructor for Advanced Lab Techniques Created a two-day advanced molecular modeling lecture/lab course for 1st-year graduate students 08/2009-12/2009 Teaching Assistant, Undergraduate General Chemistry I: Lab Instructor of three lab sections for science (non-chemistry) majors 08/2008-05/2009 Teaching Assistant, Undergraduate Honors Chemistry: Recitation and Lab One section accelerated-chemistry lab and three sections of weekly recitations Fort Lewis College 09/2004-04/2006 Undergraduate Student Tutor, General Chemistry I/II & Biochemistry I Weekly sessions to overview concepts, address homework problems in small groups and one-on-one sessions

RESEARCH EXPERIENCE Northeastern University, Department of Chemistry and Chemical Biology Graduate Research Assistant, Dr. Mary Jo Ondrechen 09/2009-02/2014 NSF Graduate Research Fellow 2010-2013 Project 1: Designed a novel automated computational method to classify the biochemical functional roles of proteins within a superfamily. Used the method to predict the function of proteins with known structure and unknown function or with an assigned hypothetical function. Project 2: Screened small molecules in silico to target the human A2A adenosine receptor in order to identify hypoxic tumors for purposes of detecting early stage cancer.

Fort Lewis College, Department of Chemistry Minority Access for Research Careers (MARC) Fellow, Dr. Leslie Sommerville Investigated the metabolic processes of Acidobacterium capsulatum in hopes of understanding 05/2004-05/2006 bioremediation. Used bioinformatics data, practiced bacterial culturing techniques, purified proteins and performed selected protein assays using a UV spectrometer and HPLC for analysis.

CURRICULUM VITAE

JOSLYNN S. LEE

RESEARCH EXPERIENCE (CON’T) 06-08/2005 Dartmouth Medical School, Department of Biochemistry Summer Undergraduate Research Fellow (SURF), Dr. Henry Higgs Investigated lymphocyte cell surface protrusions using scanning electron microscopy (SEM) and fluorescence microscopy to determine their role in cell motility and metastasis.

PROFESSIONAL EXPERIENCE 10/2006-08/2008 Research Associate, Vertex Pharmaceuticals Inc., Cambridge, MA In the Drug Innovative Pharmacokinetics (DIPK) Discovery Bioanalytical Chemistry group, developed methods to detect and quantitate small molecules by LC-MS from in vivo biological samples. Implemented a pre-screening automated sample analysis process to increase the efficiency by 25% on weekly studies. Performed routine maintenance and trouble-shooting of HPLC and LC-MS systems. Developed and maintained the department website to centralize essential laboratory and business information.

Research Assistant, Lab Pros Inc., Waltham, MA 07-09/2006 Contract position at Vertex Pharmaceuticals Inc. for the Physical Analytical Chemistry Department (PAC) group that consisted of maintenance and trouble-shooting of LC-MS systems. Conducting high-throughput purification studies for small molecules using LC-MS, fraction collection and purification methods.

PUBLICATIONS Lee, J.S. and MJ Ondrechen (2014) Using the Structurally Aligned Local Sites of Activity (SALSAs) automated computational method using Delaunay Triangulation identify protein function annotation of structural genomics proteins within the Ribulose-phosphate binding barrel (RPBB) superfamily. (In preparation)

Thomas, R., Lee, JS, Chevalier, V., Selesniemi, K., Hatfield, S., Ondrechen, MJ, Sitkovsky, M, Jones, GB. (2013) Design and evaluation of xanthine based adenosine receptor antagonists: Potential hypoxia targeted immunotherapies. Bioorganic & Medicinal Chemistry. 21, 23, 7453-7464.

Wang. Z., Yin, P., Lee. J.S., Parasuram, R., Somarowthu, S., Ondrechen, MJ. (2013) Protein Function Annotation with Structurally Aligned Local Sites of Activity (SALSAs). BMC Bioinformatics. 14(Suppl 3):S13.

Lee, J.S. and MJ Ondrechen. (2011) Electrostatic Properties for Protein Functional Site Prediction. In: Kihara, D. (1st Ed.) Protein Function Prediction for Omics Era. (pp. 183-196) USA: Springer.

Parasuram, R., Lee, J. S., Yin, P., Somarowthu, S., Ondrechen, MJ. (2010) Functional Classification of Protein 3D Structures From Predicted Local Interaction Sites. Journal of Bioinformatics and Computational Biology. 8, SI1, 1-15.

ABSTRACTS AND PRESENTATIONS Lee, J. Make the Best of Graduate School to Land that Next Opportunity: A Postdoctoral Position. AISES National Conference, November 2013. Denver, CO. (Oral Presentation)

Lee, J., Garg, R., Parasuram, R., Tian , L., Cooperman, G., Suciu, A., Ondrechen, MJ.Using the Structurally Aligned Local Sites of Activity (SALSA) computational method to determine biochemical function of structural genomics proteins. Protein Society Meeting, July 2013, Boston, MA (Poster Presentation)

CURRICULUM VITAE

JOSLYNN S. LEE

ABSTRACTS AND PRESENTATIONS (CONTINUED) Lee, J., Thomas, R., Chevalier, V., Selesniemi, K., Hatfield, S., Ondrechen, MJ., Sitkovsky, M., Jones, GB. Molecular modeling and small molecular design of xanthine based adenosine receptor antagonists. RICT International Conference on Medicinal Chemistry: Drug Discovery and Selection, July 2013, Nice, France. (Poster Presentation)

Lee, J., Ondrechen, MJ. Structurally aligned local sites of activity (SALSAs) computational method for the prediction of function of structural genomics proteins. 245th ACS National Meeting. April 2013, New Orleans, Louisiana. (Poster Presentation – selected for Sci-Mix)

Lee, J., Ondrechen, MJ. Using the Structurally Aligned Local Sites of Activity (SALSAs) computational method to predict function for structural genomics proteins. Pacific Symposium on Biocomputing (PSB), January 2013, Big Island, Hawaii. (Poster Presentation)

Lee, J., Ondrechen, MJ. Using the Structurally Aligned Local Sites of Activity (SALSAs) computational method to predict function for structural genomics proteins. 3rd Annual Computational Biology and Innovation PhD Symposium, December 2012, Dublin, Ireland. (Oral Presentation)

Lee, J. Preparing students for success in graduate school. AISES National Conference, November 2012. Anchorage, AK. (Oral Presentation- Highly rated session, positive feedback)

Lee, J., Ondrechen, MJ. Functional assignment of structural genomics proteins in the ribulose-phosphate binding barrel superfamily. 244th ACS National Meeting, August 2012, Philadelphia, PA. (Poster Presentation)

Lee, J., Chevalier, V., Hatfield, S., Ondrechen, MJ, Jones, G. Molecular Modeling and Rational Small Molecule Design for the Human A2A Adenosine Receptor. Protein Society Meeting, August 2012, San Diego, CA (Poster Presentation)

Lee, J. Ondrechen, MJ. Structurally Aligned Local Sites of Activity (SALSA): A computational method for functional annotation of structural genomics proteins. Trends in Enzymology (TinE) Conference, June 2012. Gottingen, Germany. (Poster Presentation)

Lee, J., Ondrechen, MJ, Jones, G. Molecular Modeling and Rational Small Molecule Design for the Human A2A Adenosine Receptor. St. Jude’s Hospital National Graduate Student Symposium (NGSS), March 2012, Memphis, TN. (Oral & Poster Presentation)

Lee, J. Ondrechen, MJ. Functional and Structural Relationships of Proteins in the Ribulose-Phosphate Binding Barrel Superfamily. AISES National Conference, November 2011. Minneapolis, MN. (Third place in Graduate Oral Presentations)

Lee, J., Ondrechen, MJ, Jones, G. Molecular modeling and small molecule design for the human A2A adenosine receptor. Protein Society Meeting, July 2011, Boston, MA. (Poster Presentation)

Lee, J., Somarowthu, S., Ondrechen, MJ. Identification of functional subclasses in the ribulose-phosphate binding barrel superfamily using computational tools. ISMB/ECCB Meeting, July 2011, Vienna, Austria. (Poster Presentation)

Lee, J. Discovering the Scientist Within: My perspective of interdisciplinary research. MARC U*STAR Symposium, Fort Lewis College, January 2011, Durango, CO. (Invited Oral Presentation)

Lee, J., Ondrechen, MJ, Jones, G. Computationally guided ligand design for the human adenosine A2A receptor for nanoparticle imaging of hypoxic tumors. AISES National Conference, November 2010, Albuquerque, NM. (First place in Graduate Oral Presentations)

CURRICULUM VITAE

JOSLYNN S. LEE

ABSTRACTS AND PRESENTATIONS (CONTINUED) Lee, J., Somarowthu, S., Ondrechen, MJ. Identification of functional subclasses in the ribulose-phosphate binding barrel superfamily using computational tools. 240th ACS National Meeting, August 2010, Boston, MA. (Poster Presentation)

Lee, J., Somarowthu, S., Ondrechen, MJ. Identification of functional subclasses in the ribulose-phosphate binding barrel superfamily using computational tools. ISMB Meeting, July 2010, Boston, MA. (Poster Presentation)

Lee, J., Somarowthu, S., Ondrechen, MJ. Computationally guided ligand design for the human adenosine A2A receptor for nanoparticle imaging of hypoxic tumors. U of Texas-Austin IGERT Workshop, April 2010, Austin, TX. (Poster Presentation)

Sommerville, L. E., Fujii, C., Lee, J., and DelaRosa, C. Glucose Metabolism and Enzymology in Acidobacterium capsulatum. 229th ACS National Meeting, April 2006, San Diego, CA. (Poster Presentation)

Lee, J. Nicholson-Dykstra, S., Higgs, H. “Investigating Surface Structures of Lymphocytes.” 45th ASCB Conference, December 2005, San Francisco, CA. (Poster Presentation)

AWARDS AND HONORS Presidential Volunteer Service Award from the White House – Bronze (100 to 249 hours) 2013 Protein Society Finn Wold Travel Award to San Diego, CA 2012 St. Jude’s Hospital National Graduate Student Symposium (NGSS) Speaker 2012 *competitive, top 3% of invited applicants selected to attend Travel Award to AISES National Conference in Minneapolis, MN 2011 Travel Award to ISMB/ECCB International Conference in Vienna, Austria 2011 National Science Foundation - Graduate Research Fellowship *competitive, 3-year award 2010-2013 NSF-IGERT Nanomedicine Traineeship at Northeastern University 2010 Sequoyah Fellow, American Indian Science and Engineering Society (AISES) 2010 Laguna Education Foundation Scholar 2008-2010 Catching the Dream Scholarship 2008-2009 Vertex Team VOCAP, Vertex Pharmaceuticals Inc. *2nd highest company award 2008 Fort Lewis Chemistry Department Senior Award 2006 Native American Honor Society (NAHS), President 2005-2006 Fort Lewis College Dean's List 2005 Minority Access for Research Careers (MARC) U*STAR Fellow 2004-2006 Catching the Dream Scholarship *Lifetime award 2002-2006 Laguna Education Foundation Scholar *Lifetime award 2002-2006

MEMBERSHIPS AND PROFESSIONAL SOCIETIES Society for Advancement of Chicano and Native Americans in Science (SACNAS) 2013-present American Indian Science and Engineering Society (AISES) - Sequoyah Fellow Lifetime 2010-present International Society for Computation Biology (ISCB) 2009-present Protein Society 2009-present Sigma Xi 2006-present American Chemical Society (ACS) 2004-present

CURRICULUM VITAE

JOSLYNN S. LEE

TECHNICAL SKILLS (CURRENT ARE BOLD) Bioanalytical: Sample Preparation (Solid Phase Extraction & Liquid-liquid extraction), method development & maintenance of HPLC, LC-MS systems; Physical: Scanning Electron Microscopy (SEM), Fluorescence Microscopy, Infrared Spectroscopy; Biochemical: Protein Purification, Bacteria Culture, PCR, Gel-Electrophoresis (SDS-PAGE & Western Blot), Column Chromatography, ELISA and Flow Cytometry; Computer: Watson LIMS, Analyst Software (LC-MS), Agilent Chemstation Software (HPLC), YASARA, AutoDock, Schrodinger's GLIDE, MAESTRO, CHIMERA, PyMOL, Multiconformation Continuum Electrostatics (MCCE), Gaussian 3.0, , Dreamweaver Webpage Design, Wordpress, VirtualBox (VM), Python, AMBER and ambertools

PROFESSIONAL ACTIVITIES Advisory Council, Youth Enrichment Services (Y.E.S.) Kids, Boston, MA. Serve on the Program Committee that identifies program sustainability, the development of new 12/2011-08/2013 programs and the expansion existing program. Organized and analyzed the reported data of youth and volunteer attendance in specific program, to guild a pilot coaching-mentoring initiative. Y.E.S. introduces Boston kids and teens to outdoor sports and teaches them positive values, developing a healthy lifestyle, teamwork and self-confidence building.

Non-Residential Runner, Back on my Feet (BOMF), Boston, MA. 05/2010-10/2013 Volunteer as a non-residential runner to help promote the self-sufficiency of the homeless population by engaging them in running as a means to build confidence, strength and self- esteem. Meet three times a week, MWF, at 5:45AM for morning runs. Fundraised over $4000 for the non-profit and run over 2500 miles with team, so far.

Mentor Scientist, Science Club for Girls (SCFG), Cambridge, MA. Hands-on involvement as a mentor/teacher for the after-school (10-week) science club program 09/2009–08/2013 for K-7 grade girls belonging to groups that are underrepresented in the sciences. Led a club of kindergarten students to understand matter (solid, liquid, gas) and sixth grade girls engineering to build structures and trained other mentor scientists on leadership, classroom awareness and diversity. Young girls have an opportunity to get involved in science and engineering activities in a fun, nurturing, interactive environment, an opportunity they might not otherwise receive. 05/2008-12/2011 NU Chemistry Graduate Student Association (GSA), Boston, MA. 01/2013-12/2013 Regularly update and maintain the GSA website for the chemistry and chemical biology department. Helped on the Colloquium Committee to organize the selection of speakers for the colloquium series and the alumni series. The colloquium website was updated for a common place to find announcements and relevant information about the visiting speakers.

Gallery Guide, Harvard Museum of Natural History (HMNH), Cambridge, MA. Engage the public, children and adults, interpreting the various exhibits through hands-on activities, 11/2007-12/2011 nature-story readings, and other public programs like lectures, festivals and family events to enhance understanding of the natural world.