THE APPLICATION OF SEMIEMPIRICAL METHODS IN DRUG DESIGN

By

MARTIN B. PETERS

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007

1 c 2007 Martin B. Peters

2 For Jane

3 ACKNOWLEDGMENTS Words cannot describe my Jane. She is everything I can could ask for. She has stood by me even when I left Ireland to pursue my dream of getting my PhD. Thank you honey for your love, support and the sacrifices you have made for us. I thank my mother for always giving me tremendous support and for her words of wisdom and encouragement. I would also like to thank my two brothers, Patrick and Francis, and my two sisters, Marian and Deirdre, for all their encouragement and support. Kennie thank you for giving me the opportunity to work with you; I have truly enjoyed the experience. I would like to express my gratitude to all Merz group members

especially Kaushik, Andrew, Ken, Kevin, and Duane for their support and friendship. Also I would like to acknowledge the effort of Mike Weaver who helped by editing this dissertation.

4 TABLE OF CONTENTS page

ACKNOWLEDGMENTS ...... 4 LIST OF TABLES ...... 8 LIST OF FIGURES ...... 11 LIST OF ABBREVIATIONS ...... 15

ABSTRACT ...... 19

CHAPTER 1 INTRODUCTION ...... 21 2 THEORY AND METHODS ...... 25

2.1 Receptor-Ligand Binding Free Energy ...... 28 2.2 Computational Drug Design ...... 30 2.3 Molecular Mechanics ...... 32 2.4 Quantum Mechanics ...... 33 2.5 Ligand Based Drug Design ...... 34 2.5.1 3D-QSAR with QM descriptors ...... 35 2.5.2 Field-based Methods ...... 36 2.5.3 Spectroscopic 3-D QSAR ...... 37 2.5.4 Quantum QSAR and Molecular Quantum Similarity ...... 39 2.6 Receptor Based Drug Design ...... 40 2.7 Semiempirical Divide-And-Conquer Approach ...... 42 2.8 Pairwise Energy Decomposition (PWD) ...... 44 2.9 Quantum Mechanical Charge Models ...... 46 2.10 Comparative Binding Energy Analysis (COMBINE) ...... 47 2.11 SemiEmpirical Comparative Binding Energy Analysis (SE-COMBINE) .. 48 2.12 Graph Theory ...... 49 2.13 Statistical Methods ...... 54 2.14 Metalloproteins ...... 59 3 MODELING TOOL KIT++ ...... 67

3.1 Introduction ...... 67 3.2 Overview ...... 68 3.2.1 Development ...... 68 3.2.2 Library Hierarchy ...... 69 3.2.3 Molecule Library ...... 70 3.2.4 Graph Library ...... 77 3.2.5 MM Library ...... 78 3.2.6 GA Library ...... 78

5 3.2.7 Statistics Library ...... 80 3.2.8 Molecular Fragment Library ...... 80 3.2.9 Parsers Library ...... 82 3.3 Hybridization, Bond Order and Formal Charge Perception ...... 83 3.4 Ring Perception ...... 87 3.5 Addition of Hydrogen Atoms to Molecules ...... 92 3.6 Conformational Sampling ...... 94 3.7 Substructure Searching/ Functionalize ...... 98 3.8 Clique Detection/ Maximum Common Pharmacophore ...... 101 3.9 Superimposition ...... 102 3.10 Conclusions ...... 104

4 SEMIFLEXIBLE QUANTUM MECHANICAL ALIGNMENT OF DRUG-LIKE MOLECULES ...... 106 4.1 Introduction ...... 106 4.2 Implementation ...... 110 4.2.1 Ligand Conformational Searching ...... 110 4.2.2 Structural Alignment and Clique Detection ...... 111 4.2.3 Semiempirical Similarity Score ...... 112 4.3 Results and Discussion ...... 113 4.3.1 Data Set ...... 113 4.3.2 Carboxypeptidase A ...... 117 4.3.3 Glycogen Phosphorylase ...... 118 4.3.4 Immunoglobin ...... 119 4.3.5 Streptavidin ...... 121 4.3.6 Dihydrofolate Reductase ...... 123 4.3.7 Trypsin ...... 125 4.3.8 Estrogen Receptor ...... 128 4.3.9 Peroxisome Proliferator-Activated Receptorγ ...... 131 4.3.10 Human Carbonic Anhydrase II ...... 132 4.3.11 Thrombin ...... 136 4.3.12 Elastase ...... 136 4.3.13 Thermolysin ...... 140 4.4 Conclusions ...... 144 5 METAL CLUSTER MOLECULAR MECHANICS PARAMETERIZATION . . 146 5.1 Introduction ...... 146 5.2 Implementation ...... 148 5.2.1 Equilibrium Bond Lengths and Angles ...... 150 5.2.2 Force Constants ...... 150 5.2.3 Point Charges ...... 151 5.3 Zinc AMBER Force Field ...... 152 5.3.1 Protein Data Bank Survey of Zinc Containing Proteins ...... 154 5.3.2 Tetrahedral Zn Environment Force Field Parameterization ..... 157

6 5.4 Conclusions ...... 183 6 CONCLUSIONS ...... 189

APPENDIX A ALGORITHMS ...... 191

A.1 Subgraph Isomorphism Algorithm ...... 191 A.2 Maximum Common Pharmacophore ...... 193 B AMBER GRADIENTS ...... 194

B.1 Vector Math and Derivatives ...... 194 B.2 AMBER First Derivatives ...... 195 B.2.1 Bond ...... 195 B.2.2 Angle ...... 196 B.2.3 Dihedral ...... 197 B.2.4 Electrostatic ...... 201 B.2.5 van der Waals ...... 202 C FRAGMENT LIBRARY ...... 203 C.1 Terminal Fragments ...... 203 C.2 Two Point Linker Fragments ...... 208 C.3 Three Point Linker Fragments ...... 212 C.4 Four Point Linker Fragments ...... 214 C.5 Five Point Linker Fragments ...... 216 C.6 Three Membered Ring Fragments ...... 217 C.7 Four Membered Ring Fragments ...... 218 C.8 Five Membered Ring Fragments ...... 219 C.9 Six Membered Ring Fragments ...... 224 C.10 Greater than Six Membered Ring Fragments ...... 229 C.11 Fused Ring Fragments ...... 230 REFERENCES ...... 237

BIOGRAPHICAL SKETCH ...... 263

7 LIST OF TABLES Table page

2-1 Correspondence between Graph Theory and Chemical Terminology...... 53 3-1 Disulfide Bond Prediction Parameters...... 73 3-2 Meng Atomic Covalent Radii...... 84 3-3 Labute Algorithm Upper Bound Bond Conditions...... 85

3-4 Labute Algorithm Atom Hybridization Assignment...... 86 3-5 Labute Algorithm Lower Bound Single Bond Lengths...... 86 3-6 Labute Algorithm Bond Weights...... 87 3-7 Hydrogen Bond Lengths...... 94 3-8 Hydrogen Bond Angles...... 94

3-9 Hydrogen Bond Dihedrals...... 95 3-10 Dihedral Angles Available based on Bond Type...... 95 4-1 Compound Alignment Literature...... 107 4-2 Protein-Ligand Data Set...... 115

4-3 Statistics of CuTieP Performance...... 117 4-4 Carboxypeptidase A Ligand Alignments...... 118 4-5 Glycogen Phosphorylase Ligand Alignments...... 120 4-6 Immunoglobin Ligand Alignments ...... 123

4-7 Streptavidin Ligand Alignments ...... 125 4-8 Dihydrofolate Reductase Ligand Alignments...... 127 4-9 Trypsin Ligand Alignments ...... 130 4-10 Estrogen Receptor Ligand Alignments...... 132 4-11 PPARγ Ligand Alignments...... 132

4-12 40 Human Carbonic Anhydrase II Inhibitors...... 134 4-13 Human Carbonic Anhydrase II Results...... 138 4-14 Thrombin Ligand Alignments ...... 139

8 4-15 Elastase Ligand Alignments...... 140 4-16 Thermolysin Ligand Alignments...... 142 5-1 Metal Ions in the Protein Data Bank...... 146 5-2 Published Metalloprotein Force Fields Using the Bonded Plus Electrostatics Model...... 148 5-3 Metal-Donor Bond Target Lengths...... 153 5-4 Ideal Angles Used to Calculate Root Mean Square Deviations for Tetrahedral, Square Planar, Trigonal Bipyramidal, Square Pyramid and Octahedral Geometries...... 155 5-5 Tetrahedral Zinc Primary Ligating Residues...... 157 5-6 Zn-CCCC Cluster Bond Lengths and Force Constants...... 159

5-7 Zn-CCCC Cluster Angles and Force Constants...... 160 5-8 Zn-CCCH Cluster Bond Lengths and Force Constants...... 160 5-9 Zn-CCCH Cluster Angles and Force Constants...... 161 5-10 Zn-CCCH Cluster Angles and Force Constants...... 161

5-11 Zn-CCHH Cluster Bond Lengths and Force Constants...... 162 5-12 Zn-CCHH Cluster Angles and Force Constants...... 163 5-13 Zn-CHHH Cluster Bond Lengths and Force Constants...... 163 5-14 Zn-CHHH Cluster Angles and Force Constants...... 164 5-15 Zn-HHHH Cluster Bond Lengths and Force Constants...... 164

5-16 Zn-HHHH Cluster Angles and Force Constants...... 165 5-17 Cysteine Charges using ChgModA for the Zn-CCCC, -CCCH, -CCHH, and -CHHH Clusters...... 167

5-18 Cysteine Charges using ChgModB for the Zn-CCCC, -CCCH, -CCHH, and -CHHH Clusters...... 167 5-19 Histidine Charges using ChgModA for the Zn-CCCC, -CCCH, -CCHH, and -CHHH Clusters...... 168 5-20 Histidine Charges using ChgModB for the Zn-CCCC, -CCCH, -CCHH, and -CHHH Clusters...... 170 5-21 Zn-HHHO Cluster Bond Lengths and Force Constants...... 170

9 5-22 Zn-HHHO Cluster Angles and Force Constants...... 171 5-23 Zn-HHOO Cluster Bond Lengths and Force Constants...... 172 5-24 Zn-HHOO Cluster Angles and Force Constants...... 174 5-25 Zn-HOOO Cluster Bond Lengths and Force Constants...... 182

5-26 Zn-HOOO Cluster Angles and Force Constants...... 183 5-27 Histidine and Water’s Partial Charges using ChgModB for the Zn-HHHO, -HHOO, and -HOOO Clusters...... 184 5-28 Zn-HHHD and Zn-HHDD Cluster Bond Lengths and Force Constants...... 185

5-29 Zn-HHHD Cluster Angles and Force Constants...... 185 5-30 Zn-HHDD Cluster Angles and Force Constants...... 187 5-31 Histidine and Aspartate Residue Charges using ChgModB for the Zn-HHHD and -HHDD Clusters...... 188

C-1 Terminal Fragments...... 203 C-2 Two Point Linker Fragments...... 208 C-3 Three Point Linker Fragments...... 212 C-4 Four Point Linker Fragments...... 214

C-5 Five Point Linker Fragments...... 216 C-6 Three Membered Ring Fragments...... 217 C-7 Four Membered Ring Fragments...... 218 C-8 Five Membered Ring Fragments...... 219

C-9 Six Membered Ring Fragments...... 224 C-10 Greater than Six Membered Ring Fragments...... 229 C-11 Fused Ring Fragments...... 230

10 LIST OF FIGURES Figure page

2-1 Drug Development Process...... 25 2-2 The Iterative Drug Design Process...... 26 2-3 Thermodynamic Cycle of Receptor-Ligand Binding ...... 29 2-4 Computational Component of Drug Design...... 31

2-5 Hierarchy of QM methods used in SBDD...... 35 2-6 NMR QSAR...... 38 2-7 The Classic “Pac-man” Representation of Receptor-Ligand Binding...... 41 2-8 PWD Density Matrix Representation ...... 41 2-9 Schematic Diagram of the Human Carbonic Anhydrase II inhibitor Fragmentation...... 46 2-10 SE-COMBINE Descriptor Table...... 49 2-11 Schematic Diagram of a Trypsin Inhibitor Fragmentation...... 50

2-12 SE-COMBINE Intermolecular Interaction Map (IMM)...... 51 2-13 Graph Theory I...... 52 2-14 Graph Theory II...... 54 2-15 Principal Component Analysis (PCA) Schematic Diagram of the Matrices and Vectors Involved...... 58 2-16 Partial Least Squares (PLS) Schematic Diagram of the Matrices and Vectors Involved...... 60 2-17 Most Common Amino Acid Residues which Bond to Metal Ions...... 61

2-18 Zinc Metalloproteins...... 63 2-19 Copper Metalloproteins...... 64 2-20 Homo-Nuclear Metalloproteins...... 65 2-21 Hetero-Nuclear Metalloproteins...... 66

3-1 Computational Drug Design...... 67 3-2 Library Hierarchy as Implemented in MTK++...... 69

11 3-3 Core Class hierarchy of the Molecule Library as implemented in MTK++. ... 71 3-4 Class Hierarchy of the Parameters Component of the Molecule Class as Implemented in MTK++...... 72 3-5 Class Hierarchy of the Standard Library Component of the Molecule Class as Implemented in MTK++...... 72 3-6 Disulfide Bond in Proteins...... 73 3-7 The Structural Types of the Histidine Residue...... 74

3-8 Class Hierarchy of the Molecule Component of the Molecule Class as Implemented in MTK++...... 77 3-9 Class Hierarchy of the Graph Library as Implemented in MTK++...... 78 3-10 Class Hierarchy of the MM library as Implemented in MTK++...... 79

3-11 Class Hierarchy of the GA Library as Implemented in MTK++...... 81 3-12 Class Hierarchy of the Statistics Library as Implemented in MTK++...... 82 3-13 Class Hierarchy of the Parsers Library as Implemented in MTK++...... 83 3-14 Hybridization, Bond Order, and Formal Charge Perception Using the Labute Algorithm...... 88 3-15 Ring Perception...... 90 3-16 Ring Perception Contd...... 91 3-17 Aromatic, Non-aromatic, and Anti-aromatic Rings...... 93

3-18 Hydrogen Bond...... 94 3-19 Rotatable Bond Types...... 96 3-20 Systematic Conformational Searching...... 96 3-21 Conformer Generation...... 97

3-22 Ullman Subgraph Isomorphism Illustration...... 99 3-23 Clique Detection Illustration...... 103 3-24 Molecular Superposition...... 104 4-1 Carboxypeptidase A Ligands...... 119 4-2 1CBX Conformer Analysis...... 120

4-3 Carboxypeptidase A Alignment Results...... 121

12 4-4 Glycogen Phosphorylase Ligands...... 122 4-5 Glycogen Phosphorylase Alignment Results...... 123 4-6 Immunoglobin Ligands ...... 124 4-7 Immunoglobin Alignment Results...... 125

4-8 Streptavidin Ligands ...... 126 4-9 Streptavidin Alignment Results...... 127 4-10 Dihydrofolatreductase Ligands...... 128 4-11 Trypsin Inhibitors...... 129 4-12 Trypsin Alignment Results...... 130

4-13 Estrogen Receptor Ligands...... 131 4-14 Peroxisome Proliferator-Activated Receptor γ Agonists...... 133 4-15 HCA II Ligands...... 137 4-16 Thrombin Inhibitors...... 139

4-17 Elastase Ligands...... 141 4-18 Elastase Alignment Results...... 142 4-19 Thermolysin Inhibitors...... 143 4-20 Thermolysin Alignment Results...... 144

5-1 Approaches to Incorporate Metal Atoms into Molecular Mechanics Force Fields. 147 5-2 MCPB Flow Diagram...... 150 5-3 Metal Ligand Geometries Perceived Using Harding’s Rules...... 154 5-4 Zinc Coordination Geometry Distribution from the PDB...... 156 5-5 The Most Common Tetrahedral Zinc Coordinating ligands Combination Distribution...... 158 5-6 Zn-S Bond Length Distributions in CCCC, CCCH, CCHH, and CHHH Tetrahedral Environments...... 172

5-7 Box Plots of Zn-S/N Bond Lengths in CCCC, CCCH, CCHH, CHHH, and HHHH environments...... 173 5-8 Tetrahedral Zn-O(Asp/Glu) and Zn-N(His) Bond Length Distributions...... 175

13 5-9 ZAFF Flow Diagram...... 176 5-10 Zn-CCCC Cluster Models (PDB ID: 1A5T)...... 176 5-11 Zn-CCCH Cluster Models (PDB ID: 1A73 and 2GIV)...... 177 5-12 Zn-CCHH Cluster Models (PDB ID: 1A1F)...... 178

5-13 Zn-CHHH Cluster Models (PDB ID: 1CK7)...... 178 5-14 Zn-HHHH Cluster Models (PDB ID: 1PB0)...... 179 5-15 Correlation between Zn-S and Zn-N Bond Lengths and Calculated Force Constants through the Series CCCC, CCCH, CCHH, CHHH, and HHHH...... 180

5-16 Zn-HHHO Cluster Models (PDB ID: 1CA2)...... 181 5-17 Zn-HHOO Cluster Models (PDB ID: 1VLI)...... 181 5-18 Zn-HOOO Cluster Models (PDB ID: 1L3F)...... 182 5-19 Zn-HHHD and Zn-HHDD Cluster Models (PDB ID: 2USN and 1U0A)...... 186

14 LIST OF ABBREVIATIONS Abbreviation page

PDB ProteinDataBank ...... 21 DD DrugDesign ...... 25 NDA NewDrugApplication ...... 25 IND InvestigationalNewDrug ...... 25

FDA FoodandDrugAdministration ...... 25 ADME Absorption, Distribution, Metabolism, and Excretion ...... 25 SBDD Structure-BasedDrugDesign ...... 30 LBDD Ligand-BasedDrugDesign ...... 30

MM MolecularMechanics ...... 32 QM QuantumMechanics ...... 33 HF HartreeFock ...... 33 DFT DensityFunctionalTheory ...... 33 SE SemiEmpirical ...... 33

MNDO Modified Neglect of Differential Overlap ...... 33 AM1 AustinModel1 ...... 33 PM3 ParametricModel3 ...... 33 PDDG/PM3 Pairwise Distance Directed Gaussian modification ofPM3 ...... 33

SCC-DFTB Self-Consistent-Charge Density-Functional Tight-Binding ...... 33 RBDD Receptor-Based DrugDesign ...... 34 QSAR Quantitative Structure Activity Relationship ...... 34 MLR MultipleLinearRegression ...... 34

PCR Principal Component Regression ...... 34 PLSR PartialLeastSquares Regression ...... 34 CNNs ComputerNeuralNetworks ...... 34 HOMO Highest Occupied Molecular Orbital ...... 35

15 LUMO Lowest Unoccupied Molecular Orbital ...... 35

CODESSA COmprehensive DEscriptors for Structural and Statistical Analysis 35 CoMFA Comparative Molecular Field Analysis ...... 36 CoMSIA Comparative Molecular Similarity Indices Analysis ...... 36 PLS PartialLeastSquares ...... 36

PIE ProbeInteractionEnergy ...... 36 QSM QuantumSimilarityMeasure ...... 39 CSI Carb´oSimilarityIndex ...... 39 QQSAR QuantumQSAR ...... 39

QSSA Quantum Similarity Superposition Algorithm ...... 39 QTMS Quantum Topological Molecular Similarity ...... 39 BCPs BondCriticalPoints ...... 39 AIM Atoms-In-Molecules ...... 39 DnC Divide-and-Conquer ...... 42

NDDO Neglect of Differential Diatomic Overlap ...... 43 SASA Solvent Accessible Surface Area ...... 43 PWD Pairwise EnergyDecomposition ...... 44 CNDO Complete Neglect of Differential Overlap ...... 44

CM1 ChargeModel1 ...... 46 CM2 ChargeModel2 ...... 46 RESP Restrained ElectroStatic Potential ...... 46 MK Merz-Singh-Kollman ...... 46

COMBINE Comparative Binding Energy Analysis ...... 47 SE-COMBINE SemiEmpirical-Comparative Binding Energy Analysis ...... 48 IMM InterMolecular interaction Map ...... 48 LOO Leave-One-Out ...... 55 PRESS predicted residual sum of squares ...... 55

16 SDEC Standard Deviation of Error of Calculations ...... 55

SDEP Standard Deviation ofErrorPrediction ...... 55 RMSD RootMeanSquaredDeviation ...... 56 PCA Principal Component Analysis ...... 56 PC PrincipalComponent ...... 56

CYS Cysteine ...... 59 MET Methionine ...... 59 ASP AsparticAcid ...... 59 GLU GlutamicAcid ...... 59

HIS Histidine ...... 59 HCAII HumanCarbonicAnhydrase II ...... 60 MTK++ ModelingToolKit++ ...... 67 API Application Programming Interface ...... 67 GA GeneticAlgorithm ...... 68

BLAS Basic Linear Algebra Subprograms ...... 68 LAPACK Linear Algebra PACKage ...... 68 GAFF Generalized AMBER Force Field ...... 80 MEP Molecular Electrostatic Potential ...... 106 vdW vanderWaals ...... 106 GFs GaussianFunctions ...... 106 GA GeneticAlgorithm ...... 106 RFO RationalFunctionOptimization ...... 106

RIPS Random Incremental Pulse Search ...... 106 BFGS Broyden-Fletcher-Goldfarb-Shanno ...... 106 SD SteepestDescent ...... 106 NR Newton-Raphson ...... 106 MCP MaximumCommonPharmacophore ...... 106

17 ASA Atomic Shell Approximation ...... 106

SA SurfaceArea ...... 106 MO MolecularOrbital ...... 106 DHFR DihydrofolateReductase ...... 113 PPARγ Peroxisome Proliferator-Activated Receptor γ ...... 113

ER EstrogenReceptor ...... 113 ESP ElectroStaticPotential ...... 146 UFF UniversalForceField ...... 146 CCSD Crystallographic Structural Database ...... 146

MCPB Metal Center Parameter Builder ...... 148

18 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

THE APPLICATION OF SEMIEMPIRICAL METHODS IN DRUG DESIGN By Martin B. Peters August 2007

Chair: Kenneth M. Merz Jr. Major: Chemistry The application of quantum mechanical methods in de novo drug design is currently quite limited in both scope and utility. This thesis outlines where these methods are placed in this process and where they can be improved on. Chapters one and two of this dissertation describe the drug development process and current methods used to calculate the free energy of receptor-ligand binding. Some of the computational tools used in drug design are discussed such as scoring functions, molecular mechanics, quantum mechanics, semiempirical pair-wise energy decomposition, comparative binding energy analysis, the SE-COMBINE approach and popular 3D-QSARs approaches. The remaining chapters of this work describes the development and application of a package of computational chemistry C++ libraries called the Modeling ToolKit++

(MTK++). This toolkit was used to develop a new technique to superimpose drug-like molecules onto one another using a quantum mechanical score function. Obtaining the correct alignment of two molecules to reproduce the pose within a protein active site is a challenging problem. This new method was validated on almost 90 protein-ligand complexes for which x-ray crystallographic data was available. MTK++ was also used to develop a generalized tetrahedral Zinc force field for metalloprotein molecular dynamics simulations. It is desirable to model metalloprotein systems using MM models because one can carry out simulations to address important

19 structure/function and dynamics questions that are not currently attainable using QM and QM/MM based methods. Until now force fields for metalloproteins were built by hand through a convoluted process. The creation of a computer program to do this removes the human error factor. This program was used to build force fields for 10 Zinc tetrahedral active sites. This required the parameterization of bond and angle force constants and the calculation of partial charges. MTK++ was designed to automatically perceive metal centers and assign parameters necessary to carry out MM or MD calculations.

20 CHAPTER 1 INTRODUCTION Drug discovery has evolved from being serendipitous to a rather rational process of design. High-throughput screening, combinatorial chemistry, the human genome project, and computational methods have been developed to this end. Nonetheless, the cost of creating a drug has increased exponentially over the last 50 years [1] without the number of new drugs getting to the market increasing accordingly. The most plausible reason is a lack of fundamental understanding of molecular recognition, binding and ultimately drug

delivery processes. Computational medicinal chemistry spans a broad spectrum of disciplines including theoretical, computational, and structural chemistries. Theoretical chemistry involves the development of new and improved theories whereas computational chemistry entails the

application of established theoretical tools to chemical problems. Structural chemistry techniques such as X-ray crystallography and NMR spectroscopy have played a significant role in facilitating our understanding of molecular recognition and interaction. Although computational medicinal chemistry cannot design new drugs on its own, it has been shown that it can play a role in predicting binding free energies and geometries of receptor-ligand

complexes. Examples include the development of the HIV protease inhibitor, saquinavir, by “in silico” design as a transition-state analogue [2] and the rational design of an Angiotensin-Converting Enzyme (ACE) inhibitor called captopril [3]. The computational techniques employed to aid the drug design process include virtual

screening, docking, and scoring with the results or “hits” utilized by medicinal chemists [4]. Computational methods vary in cost; screening can be carried out on large databases of compounds, while scoring and docking are generally carried out on a smaller number of structures. Screening attempts to predict physicochemical properties of molecules such as aqueous solubility and by doing so reduces the number of molecules with poor drug-like properties being synthesized. Docking is a technique of placing a drug candidate

21 into the active site of a receptor. The docked pose of a ligand in the active site of a receptor can be scored using knowledge-, empirical-, or physics-based methods with the latter being more expensive [5]. Also computational methods lend themselves to virtual combinatorial chemistry which can be used to optimize the complementarities between a receptor and a ligand. Although this can also be done experimentally, the main reason of the computational approach is the reduction of cost and time. The computational prediction of binding free energies is still not an exact science. However, utilizing current computer hardware and theoretical technologies, problems that were hitherto reputedly unfeasible are now tractable. There are two areas where increased computational power can be used for the accurate prediction of binding free energies: increased sampling of conformational space, and interaction energy calculations using complete Hamiltonians [6]. The use of both will be investigated in this thesis. This dissertation describes the application of quantum mechanical methods in bio- and medicinal chemistry. The following chapters describe the development of computational chemistry modeling software, the flexible alignment of drug-like molecules, and the generation of a Zinc force field for metalloprotein simulations and drug design applications.

In Chapter 2 the industrial drug design process is outlined as an overview of why computational tools are used in drug design. The thermodynamic basis and current methods used to calculate the free energy of receptor-ligand binding are described. The equations of binding are derived in order to reflect the current understanding of binding, both experimentally and computationally. Some of the computational tools used in

Structure Based Drug Design (SBDD) are discussed, including scoring functions [4], Molecular Mechanics (MM), Quantum Mechanics (QM), SemiEmpirical (SE) Pair-Wise energy Decomposition (PWD) [7], the Comparative Binding Energy Analysis (COMBINE) [8], and the SE-COMBINE [9] approaches. Popular 3D-QSARs approaches are also outlined including CoMFA (Comparative Molecular Field Analysis) [10] and CoMSIA

22 (Comparative Molecular Similarity Indices Analysis) [11, 12] and two multivariate statistical tools; Principal Component Regression (PCR) [13] and Partial Least Squares (PLS) [14]. Chapter 3 outlines the design and development of the Modeling ToolKit++

(MTK++) package of C++ libraries for the use of QM methods in drug design. The algorithms such as atomic hybridization and formal charge determination, bond order and ring perception, substructure searching and clique detection are described in detail with numerous illustrations. The impetus for this work was to create a computational chemistry platform where QM methods could be conveniently incorporated in drug

discovery applications. This work was fundamental to this thesis and all modeling in later chapters used this package. The fourth Chapter describes a method to flexibly align drug-like molecules onto one another using a semiempirical scoring function. The alignment of two bodies is a

mathematical problem; however, the challenge is to reproduce the pose seen in x-ray crystallographic studies. Traditionally, molecular superposition has been carried out using empirical scoring functions due to their speed. The goal of this research was to investigate the applicability of semiempirical methods in molecular alignment and its ability to do

so was validated against over 80 protein-ligand complexes from the Protein Data Bank (PDB) [15]. The fifth Chapter outlines the development of a molecular mechanics force field (FF) for tetrahedral Zinc metalloproteins suitable for the AMBER suite of programs [16]. Several issues regarding the modeling of metalloproteins were addressed. The first goal

was to develop software to conveniently handle metalloprotein structures. The program MCPB (Metal Center Parameter Builder) was created to build and validate metalloprotein FFs for use in molecular simulations to study structure, function, and dynamics. Secondly, the automated perception of metal centers in proteins was undertaken and gave rise to

the program called pdbSearcher. This software was used to survey the PDB for all Zinc

23 containing metalloproteins. The most abundant primary shell combinations bound to Zn atoms were extracted and the FFs generated with the resulting parameters analyzed in detail. Finally, Chapter 6 provides a brief summary of the work presented in this dissertation.

I hope this dissertation demonstrates the utility of current quantum mechanical approaches in the areas of drug design and metalloprotein modeling. The use of quantum mechanical methods in drug design can be viewed as the final frontier due to the fact that these methods describe molecular interactions from first principles [6]. Nevertheless, the use of quantum mechanics over classical approaches brings extra expense and so it is necessary to show that these methods can be superior to simpler models.

24 CHAPTER 2 THEORY AND METHODS Designing a drug (a molecule which affects biological processes without causing injury) requires numerous steps from its inception to its introduction into the market.

This process takes approximately 10-15 years as shown in Fig. 2-1 and can cost in the order of a half billion dollars. This is due to both the vastness of chemical space [17–21] and the cost of research and testing [1].

Formulation Research Process Development

Compound Safety IND Phases NDA FDA Discovery Testing Preparation I,II,III Preparation Review

Pharmacokinetics Toxicology

Basic Preclinical IND Clinical NDA Submission Research Development Submission Development

3-4 1 6-8 2-3 Ongoing years year Years years

Millions 1000 10 1 Compounds Compounds Compounds Compound

Figure 2-1. Drug Development Process. Adapted from http://www.netsci.org/. An IND (Investigational New Drug) is prepared and submitted to the FDA (Food and Drug Administration) at the end of the preclinical phase of drug development. With good results from the clinical phase an NDA (New Drug Application) is submitted to the FDA for approval to release the drug to the general public.

The pre-clinical phase of Drug Design (DD) is carried out using an iterative process (first three columns of Fig. 2-1). It starts with some knowledge of a target, i.e known natural substrate or a crystal/NMR structure of the target or receptor. The target is chosen based on some known chemical feature of a biological disease. The design cycle takes many steps such as computational design, ligand design, synthesis, biochemical

25 evaluation, and crystallography converging to a drug candidate or lead as shown in Fig. 2-2 [22]. During each cycle of this process different computational tools are used with varying costs and accuracies which will be discussed in more detail later in this chapter. At the end of the pre-clinical phase an IND (Investigational New Drug) is prepared to

allow a company to test the drug in humans. After the IND is approved by the FDA (Food and Drug Administration) the clinical stage (4th column of Fig. 2-1) begins. Phase I of the clinical trials tests the toxicity, pharmacokinetics or ADME (Absorption, Distribution, Metabolism, and Excretion) properties, and dosage on approaximately 50 healthy volunteers. Phase II evaluates the drugs effectiveness and side effects on volunteer

patients ( 500) and it is at this stage where most adverse effect of the drugs use are ≈ observed. The final phase, Phase III, of clinical trials determines the effects of long term use on a large pool of volunteer patients. After phase III a company prepares an NDA (New Drug Application) and submits it to the FDA for approval to release the drug to the general public. The NDA contains results of all clinical studies and once approved by the FDA the drug can be marketed. After release the company carries out post-marketing surveillance of the drugs effectiveness in a so-called phase IV.

Target Information Crystallographic Analysis Drug Lead Computational Biochemical Testing Ligand Design

Synthesis

Figure 2-2. The Iterative Drug Design Process. Adapted from Babine and Bender [22].

26 Drug targets or receptors include enzymes, ion channels, nuclear hormone receptors, and DNA, which interact with endogenous physiological substances such as hormones and neurotransmitters. There are currently over 1200 drugs approved by the FDA for the therapeutic use in the United States, 25% of which target enzymes [23]. The majority of enzyme-targeted drugs are enzyme-substrate based and most act via non-covalent interactions. Drugs that mimic the effects of endogenous regulatory compounds are called agonists, while compounds that do not have 100% activity are termed partial agonists [24]. Drugs that bind to receptors but have no activity and prevent endogenous compounds from binding are termed antagonists or inhibitors. There are two main types of enzymatic inhibition, reversible and irreversible. Reversible inhibition occurs through competitive, noncompetitive and uncompetitive mechanisms. Diuretics used to control blood pressure and many anti-depressive agents, for example antagonists of dopamine receptors, are reversible competitive inhibitors. These drugs compete for the same binding site as the natural substrate, but the enzyme cannot process the inhibitor, thus preventing catalytic activity. Non-competitive or allosteric inhibitors bind to different regions of the enzyme and do not compete for the binding site. However, the process of binding the inhibitor can change the shape of the active site thus preventing catalytic activity. Uncompetitive inhibition takes place when the inhibitor only binds the enzyme-substrate complex, consequently preventing catalysis. Irreversible inhibition occurs when the inhibitor covalently attaches to the enzyme active site such as inhibitors of Carbonic Anhydrase [25]. Structural determination of receptors or complexes is often carried out by x-ray crystallography. It should be noted that the atomic positions from crystallography have an associated error and generally can be in the order of 1/6 of the resolution ( 0.4A˚ uncertainty from a 2.4A˚ resolution structure) [22]. ≈ A fundamental understanding of the interactions between receptors and ligands is necessary to the design of new drugs. These forces include ionic or electrostatic effects, ion-dipole and dipole-dipole interactions, charge transfer, van der Waals, and hydrophobic

27 interactions. Molecules with high biological activity usually possess a shape that is complementary (hydrophobic, electrostatic, and polar contacts are paired upon binding) to that of the receptors active site as first proposed by Fischer (“lock-and-key” hypothesis). 2.1 Receptor-Ligand Binding Free Energy

In the simplest case, receptor-ligand binding corresponds to a single ligand molecule forming a 1 : 1 complex with a receptor that contains only a single binding site as shown in Eq. 2–1. R represents the receptor, L the ligand and R L is the complex, where k and · 1 k−1 are the association and dissociation rate constants, respectively.

k1 R + L ↽⇀ R L (2–1) −−k−−−1 ·

At equilibrium, association of a receptor and ligand occurs at the same rate as dissociation and the equilibrium constants, Ka and Kd, can be defined as:

[RL] 1 Ka = = (2–2) [R][L] [Kd]

It is a common practice to use Kd for practical reasons as it has units of concentration. Kd is the concentration of free ligand at which half of the receptor binding sites at equilibrium are occupied. Small values of Kd correspond to a high affinity between the receptor and ligand. To gain a fundamental understanding of receptor-ligand binding one must begin with a thermodynamic description. The Gibbs free energy is most often used in biochemistry as

binding experiments are carried out under conditions of constant temperature, pressure, and number of particles. ∆G, Eq. 2–3 is the free energy change for the reaction, ∆H and ∆S are the enthalpy and entropy changes respectively, and T is the temperature.

∆G = ∆H T ∆S (2–3) −

The change in free energy can be expressed in terms of the equilibrium Kd as follows:

∆G = ∆G◦ RTlnK (2–4) − d

28 where ∆G◦ is the standard state (1 M, 1 bar) free energy change and R is the gas constant. When complex association and dissociation reach equilibrium, ∆G = 0, the expression takes the form:

◦ ∆G = RTlnKd (2–5)

Since free energy is a state function it can be calculated and compared with experimental

values. The free energy of binding, ∆Gbind, is calculated by determining the free energy of

◦ reactants, (∆GR + ∆GL), and products, ∆GRL, separately. The superscript ” ” is dropped from the remaining equations for simplicity; however, it is implied.

∆G = ∆G (∆G + ∆G ) (2–6) bind RL − R L

∆G R + L gas RL −→ R L RL ∆Gsolv ∆Gsolv ∆Gsolv ∆G R + L solv RL   −→  y y y Figure 2-3. Thermodynamic Cycle of Receptor-Ligand Binding

Using the thermodynamic cycle in Fig. 2-3 and Eq. 2–6, the free energy of binding in

solv gas solution, ∆Gbind, can be fully decomposed in Eq. 2–7 [26]. ∆Gbind is the free energy of complexation in the gas phase. This term is dominated by the enthalpic contributions from steric and electrostatic interactions. ∆∆Gsolv, is the solvation free energy of

L R complexation, which incorporates the desolvation of the ligand, ∆Gsolv, receptor, ∆Gsolv,

RL and complex, ∆Gsolv.

solv gas ∆Gbind = ∆Gbind + ∆∆Gsolv (2–7)

gas where ∆Gbind and ∆∆Gsolv are defined by equations 2–8 and 2–9.

∆Ggas = ∆Hgas T ∆Sgas (2–8) bind bind − bind ∆∆G = ∆GRL ∆GR ∆GL (2–9) solv solv − solv − solv

29 For tight binding ligands the interactions in the complex are significantly stronger than those of the receptor and ligand alone in solution. Also the favorable enthalpic interactions must compensate the entropic loss of conformational degrees of freedom for both the receptor and ligand plus the three rotational and three translational degrees

of freedom. It should be noted that small variations in a complex’s stability (∆G) in

kcal/mol corresponds to large differences in affinity (Kd). For example, a difference of 5kcal/mol coincides with three orders of magnitude variation in observed affinity. 2.2 Computational Drug Design

The computational components of the drug design process take place during the initial stages of each iterative cycle as shown in Fig. 2-2 and the main reasons for their use is to reduce costs and provide atomic level insight into receptor-ligand interaction. This are can be broken down in to two area: Structure-Based Drug Design and Ligand-Based

Drug Design . The former requires structural knowledge of the receptor while the latter does not and both will be discussed in detail below. The early iterations of the drug design process involved the searching or screening of databases of molecules such as ZINC [27, 28] and other combinatorial libraries [29] for compounds which may be active

against the target [30–32], thus separating drugs from non-drugs [33–35]. Screening can involve similarity/dissimilarity searching [36] against a known active/inactive molecule. Compounds can be compared to each other in 1D [37], 2D [38] or 3D [39, 40] with the later technique being the most expensive. Simple counting techniques [41] such as

Lipinski’s “rule-of-five” [42] are also used to filter out non-drug molecules. Screens are used to predict ADME properties [43] such aqueous solubility [44], hepatotoxicity [45], P450 inhibition [46], and absorption [47]. Screens are also carried out to predict the synthetic accessibility of compounds thus allowing for later functional group optimization [48]. Subsequent “hits” from a screen serve as lead compounds for medicinal chemists.

De novo drug design [49–51] is another tool used to identify novel lead compounds. This

30 technique “grows” molecules in the active site of a receptor or pseudoreceptor from alignemnt of known active molecules [52].

Target

million Database milli-seconds Compounds Screening

1000s Docking seconds/ Compounds Scoring hours

100s Lead seconds/ Compounds Optimization hours

Drug Candidate

Figure 2-4. Computational Component of Drug Design. Timings are per compound.

Lead, or “drug-like”, compounds are expected to have good pharmacokinetics and be accessible to synthetic modification. The transition from a lead compound to a drug candidate involves optimizing structural and chemical complementarities with the receptor. Docking and scoring are tools to measure the complementary between lead and receptor [4]. Docking is the process by which a ligand structure is placed in the active site of the receptor while scoring predicts the binding free energy of complex formation. Lead optimization is often used to optimize the pharmacokinetics through functional group substitution. A schematic of the computational aspect of drug design is shown in Fig. 2-4. This is drawn as a funnel to highlight that the number of compounds decreases from the top to bottom; however, most often the expense of computational tools used increases.

31 Various approaches have emerged to calculate or predict the binding free energy. These have met with varying degrees of success. They include physics-, empirical-, and knowledge-based scoring functions [5, 53–57], and various QSAR approaches [10, 11]. The results of empirical and knowledge-based scoring functions are highly dependent

on parameterization and the calculation of binding free energies of compounds unlike those in the training set can yield spurious results. Physics based scoring functions try to model each component of Eq. 2–7 from first principles. Physics-based techniques and QSAR approaches are introduced in the following sections and their advantages and disadvantages in determining the free energy of binding are discussed.

2.3 Molecular Mechanics

Molecular Mechanics (MM) force fields such as AMBER [16, 58–60], CHARMM [61], MMFF [62–69], OPLS [70], and MM3 [71] can be used to calculate the enthalpic component of the binding free energy between the receptor and ligand. The AMBER energy function, Eq. 2–10, contains bond, angle, dihedral, and non-bonded terms. The bond and angle terms are represented by harmonic expressions. The van der Waals term is a 6-12 potential, and the electrostatic is expressed as a

Coulombic interaction with atom centered point charges.

V E = K (r r )2 + K (θ θ )2 + n [1 + cos(nφ γ)] + total r − eq θ − eq 2 − bondsX anglesX dihedralsX A B q q ij ij + i j (2–10) r12 − r6 εr i

A truncated Fourier series represents the dihedral term, where Vn is the barrier height, n is the periodicity, φ is the calculated dihedral angle and γ is the phase difference.

32 The fourth term describes the steric interaction as a Lennard-Jones potential, where

∗12 ∗6 rij is the distance between atoms i and j. Aij = εijrij and Bij = 2εijrij are parameters ∗ ∗ ∗ ˚ ∗ that define the shape of the potential where rij = ri + rj in A, ri is the van der Waals

radius for atom i, and εij = √εi εj, εi is the van der Waals well depth in kcal/mol ∗ and q are the atom-centered point charges. A vigorous derivation of the gradients of the AMBER function are described in Appendix B. 2.4 Quantum Mechanics

Higher order molecular interactions such as polarization and charge transfer are neglected in molecular mechanics force fields due to their point charge based approaches. Quantum mechanical techniques intrinsically include such interactions. The high computational cost of ab initio methods such as Hartree-Fock (HF) and Density Functional Theory (DFT) restrict their use to small systems such as organic molecules, protein active sites, and metal clusters. Thanks to the work by Pople, Dewar and Stewart amongst others, the Roothaan-Hall equations have been approximated and parameterized to give us a series of so-called SemiEmpirical (SE) methods. The most commonly used SE methods are derived from the MNDO (Modified Neglect of Differential Overlap) [72] method including AM1 (Austin Model 1) [73], PM3 (Parametric Model 3) [74, 75], MNDO/d (MNDO with d orbitals) [76] and PDDG/PM3 (Pairwise Distance Directed Gaussian modification of PM3) [77]. Recently, DFT methods have been approximated creating the SCC-DFTB (Self-

Consistent-Charge Density-Functional Tight-Binding) method [78, 79]. The SCC-DFTB approach has been compared to the traditional SE methods, AM1 and PM3, with comparable errors in predicting heats of formation for a set of 622 neutral molecules; however, errors were higher than those from the PDDG/PM3 method [80]. SE methods can be used to calculate the total electrostatic energy of a molecular

system, which is the sum of the electronic energy, Eel, and core-core repulsion, Ecore−core

Etot = Eel + Ecore−core (2–11)

33 where Eel and Ecore−core are described in equations 2–12 and 2–13. In these equations H is the one-electron matrix, F is the Fock matrix, and P represents the density matrix. Z is the nuclear charge on the atom, RAB is the atomic separation between A and B, and N is the total number of atoms.

1 E = (H + F ) P (2–12) el 2 µν µν µν µ ν X X N N Z Z E = A B (2–13) core−core R A=1 B>A AB X X The use of QM in SBDD can be divided into two broad categories, receptor-based and ligand-based methods (Fig. 2-5). Receptor-Based Drug Design (RBDD) methods include scoring-, QM/MM and comparative binding energy (COMBINE)-type methods.

RBDD requires either an X-ray crystal or NMR structure of ligands in complex with the relevant receptor. Ligand-based drug design techniques include various Quantitative Structure-Activity Relationship (QSAR) methods, which rely only on knowledge of the ligand structure. In general, QSAR can be conducted using two-dimensional (2D) or three-dimensional (3D) structures; however, the user must utilize 3D structures when

using QM because of the need to have an all-atom description of the nuclei and associated electrons [6]. 2.5 Ligand Based Drug Design

One of the oldest tools used in rational drug design is QSAR (Quantitative Structure Activity Relationship) [81]. QSAR models are derived for a set of compounds with

dependent variables (activity values e.g. Ki, IC50), and a set of calculated molecular properties or independent variables called descriptors. Each compound in the data set is assumed to be in its active conformation. Models are generated using statistical techniques such as Multiple Linear Regression (MLR), Principal Component Regression (PCR) [13], Partial Least Squares Regression (PLSR) [82], and Computer Neural Networks (CNNs)

34 QM + SBDD

Ligand-Based Receptor-Based

Field-based Scoring COMBINE eg. QM-QSAR eg. QMScore eg. SE-COMBINE

3D-QSAR QM/MM eg. AMPAC+CODESSA eg. DivCon/AMBER

Figure 2-5. Hierarchy of QM methods used in SBDD.

[83] to name a few. Ligand-based methods can be further divided into two categories,

3D-QSAR and field-based methods. Both will be touched on below. 2.5.1 3D-QSAR with QM descriptors

The descriptors used in 3D-QSAR are usually divided into three categories: 1)

Electronic, such as HOMO and LUMO energies, 2) Topological, for example connectivity indices, and 3) Geometric such as moment of inertia. The models in all cases are often created using multivariate statistical tools due to the large number and high degree of collinearity of descriptors. An excellent review by Karelson, Lobanov, and Katritzky provides details of QM based descriptors used in QSAR programs such as COmprehensive DEscriptors for Structural and Statistical Analysis (CODESSA) [84]. These include those that can be observed experimentally, such as dipole moments, and those that cannot, such as partial atomic charges. Clark and co-workers have recently used AM1-based descriptors to distinguish between drugs and non-drugs and to understand the relationship between descriptors and their physical properties [85]. Most descriptors are calculated at the semiempirical level of theory using programs such as AMPAC or MOPAC. However, with computer speed increasing steadily the use of

35 ab initio and DFT methods are becoming increasingly common. These methods allow the descriptors to be calculated from first principles. Yang and co-workers examined various DFT-based descriptors to generate models for a series of protoporphyrinogen oxidase inhibitors. It was shown that the DFT-based model out performed the PM3 based model

[86]. 2.5.2 Field-based Methods

CoMFA (Comparative Molecular Field Analysis) [10] and CoMSIA (Comparative

Molecular Similarity Indices Analysis) [11, 12] are field-based or grid-based methods where all the compounds in the data set are aligned on top of one another and steric and electrostatic descriptors are calculated at each grid point using a probe atom. As a result there are many more descriptors than molecules, therefore a Partial Least Squares (PLS) data analysis is used to generate linear equations. A study by Weaver and co-workers compares different field-based methods for QSAR including CoMFA, and CoMSIA finding that field-based methods provide a robust tool to aid medicinal chemists [87]. Absent from the traditional MFA approaches are quantum mechanically derived descriptors of electronic structure. QMQSAR is a relatively new technique where semiempirical QM methods are used to develop quantum molecular field-based QSAR models [88]. Placing the aligned training set ligands into a finely spaced grid produces quantum molecular fields, where each ligand is characterized by a set of Probe Interaction Energy (PIE) values. A PIE is defined as the “electrostatic potential energy obtained by placing a positively charged carbon’s 2s orbital at a given grid point gi and summing the attractive and repulsive potentials experienced by that electron as it interacts with the field of the ligand L”:

P IE = s s V (L) (2–14) −h i i| i Natoms ∗ ∗ z χ (r2)χ (r1) = χ∗ (r )χ (r ) α P µ µ′ dr dr si 1 si 2 r r − µµ′ r r 2 1 r1 α=1 " 1 α µ∈α µ′∈α r2 1 2 #! Z X | − | X X Z | − |

36 The nuclear charge zα is simply the number of valence electrons on atom α and the notation µ α indicates the set of valence atomic orbitals centered on atom α. Density ∈ matrix elements Pµµ′ are given by the following sum over the occupied MOs:

Nocc

Pµµ′ =2 cµkcµ′k (2–15) Xk=1 When applied to data sets containing corticosteroids, endothelin antagonists, and serotonin antagonists, linear regression models were produced with similar predictability compared to various CoMFA models. 2.5.3 Spectroscopic 3-D QSAR

The Spectroscopic QSAR methods [89, 90] include EVA (Vibrational frequencies), [91–98] EEVA (MO energies),[99–102] and CoSA (NMR chemical shifts)[103]. It is a requirement of 3-D QSAR that all compounds which are being studied contain the same number of descriptors. However, none of the above techniques obey this requirement. The number of vibrational frequencies and MOs are dependent on the number of atoms, N, in a molecule (3N 6, or 3N 5 if linear). While the number of NMR chemical shifts depends − − on the number of atomic isotopes with NMR active nuclei, N. A solution of this problem is to force the information onto a bound scale using a Gaussian smoothing technique, where the upper and lower limits of this scale are consistent for all compounds in the data set. A Gaussian kernel, f(x), with a standard deviation of σ is placed over each calculated point, EVA, EEVA or NMR chemical shift as shown in Eq. 2–16. Summing the amplitudes of the overlaid Gaussian functions at intervals x along the defined range results in the descriptors for each molecule, fˆ(x), as shown in Eq. 2–17. This process is illustrated in figures 2-6(a) through 2-6(c).

1 2 2 f(x)= exp−(x−XA) /2σ (2–16) σ√2π

shifts −βi(x−Xi) fˆ(x)= αi exp (2–17) i X

37 NMR Spectrum NMR Spectrum with Gaussian Kernels interval = 1 sigma = 10 0.00 0.01 0.02 0.03 0.04 0.00 0.01 0.02 0.03 0.04 0 50 100 150 200 0 50 100 150 200 ppm ppm

(a) Calculated NMR C13 Chemical Shifts (b) NMR C13 Chemical Shifts with Gaus- sian Kernels

NMR Spectrum with Gaussian Kernels + BNMRS interval = 1 sigma = 10 Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6

0 50 100 150 200 ppm

(c) NMR C13 Chemical Shifts with Gaussian Kernels plus the spectrum projected onto a bound scale (BNMRS)

Figure 2-6. NMR QSAR. Calculated NMR spectra for a steroid molecule with Gaussian kernels place at each shift followed by a bound scale projected from the spectrum.

These descriptors contain a wealth of structural information when we consider the physical basis of the methods. IR spectroscopy provides information concerning the presence of molecular functional groups and NMR chemical shifts are highly dependent on

38 substituent effects in a congeneric series of compounds. MO energies give the electronic structure of the molecule such as the HOMO-LUMO energies that play an important role in the binding process. The choice of theory used to calculate these descriptors depends on the number of

compounds in the dataset and the accuracy that is required; all can be calculated using SE or ab initio methods. The QSAR results also depend on the choice of σ and x in the above equations. These methods have provided predictive models for a number of data sets and have an advantage over the field-based methods because they are alignment-free, in other words

there is no need to superimpose the structures in the dataset. Asikainen and co-workers provided a comparison of these methods in a recent paper where they studied estrogenic activity in a series of compounds [89]. 2.5.4 Quantum QSAR and Molecular Quantum Similarity

The Carb´ogroup has been involved in the development of the field of quantum QSAR and molecular quantum similarity since the 1980s [104]. The Quantum Similarity Measure (QSM) between any two molecules, A and B, can be calculated using the following:

Z = ρ Ω ρ = ρ (r )Ω(r r )ρ (r )dr dr (2–18) AB h A| | Bi A 1 1 2 B 2 1 2 Z Z where Ω is some positive definite operator (e.g. kinetic energy or Coulomb) and ρ is the electron density. The QSMs can be transformed into indices ranging between 0 and 1

using:

zAB rAB = (2–19) √zAAzBB yielding the so called Carb´oSimilarity Index (CSI). Calculating an array of QSMs or CSIs between all molecular pairs in some data set provides descriptors for Quantum QSAR (QQSAR) [105].

A drawback of the CoMFA-based methods is the need to superimpose the molecules in the training set. This is no easy task due to the many degrees of freedom (both rigid

39 and internal motions). However, the alignment of the molecular structures in a common 3D framework provides a convenient method of determining which regions of the molecules impact activity and which regions can be developed to create new compounds with more favorable properties. Recently, QSMs have been used with a Lamarckian genetic algorithm called the Quantum Similarity Superposition Algorithm (QSSA) to superimpose the classic CoMFA data set [106]. The QSSA is performed in such a way as to maximize the molecular similarity and does not rely on atom typing as other empirical based methods do. Popelier and co-workers have coupled the Atoms-In-Molecules (AIM) theory of Bader with quantum molecular similarity to produce Quantum Topological Molecular Similarity (QTMS) [107]. It uses the so-called Bond Critical Points (BCPs) of predefined bonds in a series of molecules as descriptors followed by multivariate statistical analysis. The series of compounds must have a common core for this method to remain computationally tractable. QTMS has been used to generate models to estimate the values for a set of aliphatic carboxylic acids, anilines, and phenols [108]. 2.6 Receptor Based Drug Design

The classic “Pac-Man” representation of Receptor-Ligand binding is shown in Fig. 2-7 where the receptor is depicted on the left subdivided into residues and on the right is a small molecule split into fragments. Proteins can be split using standard amino acids definitions, while ligand structures can be decomposed using functional group definitions.

The binding free energies between receptors and ligands can be calculated using classical and quantum mechanical methods. In most cases when QM is used in SBDD a single snap shot of this complex is taken and the interaction energy is determined. Taking ensemble averages is expensive and time consuming. The scheme in Fig. 2-8 is a matrix or graphical comparison between classical and quantum mechanical methods in SBDD.

This scheme is divided into three parts, first on the left is a large box which represents a receptor made up of smaller boxes or residues, I, such as amino acids or bases. The

40 Figure 2-7. The Classic “Pac-man” Representation of Receptor-Ligand Binding. The receptor is depicted on the left subdivided into residues and on the right is a small molecule split into fragments.

Figure 2-8. PWD Density Matrix Representation dark blue box represents how all the other residues in the receptor polarize that residue. Polarization is where the charges centered on each atom are allowed to relax in the field of all other charges. The lighter blue box symbolizes the charge transfer that can occur between residues in a receptor. The smaller box in the middle of the figure symbolizes a ligand and the smaller boxes it contains are molecular fragments, J. The pink and yellow boxes can be described in a similar fashion to the boxes of the receptor.

41 The largest box on the right is the complex structure. Both the residues, I, upon binding the fragments, J, are allowed to relax in the presence of the each other. The I residues are transformed from dark blue to mustard while the J fragments are changed from pink to brown. This is polarization; however, now it is caused by complex formation.

Most classical potentials cannot model these effects; however, recently there have been some attempts to incorporate polarization into classical methods such as ff02 and amoeba [109]. Conversely, QM methods include these interactions implicitly. The I-K (blue to grey) and J-L (yellow to magenta) interactions originate from the effect binding has on the intramolecular interaction of the receptor or ligand. These interactions can include charge transfer and polarization. The I-J interactions (red box) are the most important. Both methods can calculate these and they are only present in the complex structure. Classical potentials describe the Coulombic and van der Waals interactions or electrostatic and dispersive effects between the moieties. The QM methods go a step further to include the other higher order effects such as polarization and charge transfer. This is where the QM methods begin to describe the physics of the system more completely; however, this does not necessarily suggest that they are more predictive! 2.7 Semiempirical Divide-And-Conquer Approach

Very few full quantum mechanical studies of whole proteins have been published [7, 9, 110] but with the increasing speed of computers and linear scaling Divide-and-Conquer (D&C) techniques the ability to include the whole protein is now possible [111–113].

The D&C method takes advantage of the local character of chemical interactions that cause the magnitude of density matrix elements to decrease exponentially with distance. Through the use of cutoffs for the Fock and density matrices and D&C techniques, the “nearsightedness” of chemical interactions can be exploited without loss of accuracy [114]. The D&C method divides the molecular system into overlapping subsystems where each localized Roothaan-Hall equation can be solved separately:

F αCα = CαEα (2–20)

42 where F α, Cα, and Eα are the subsystem Fock, coefficient, and orbital energy matrices. The overlap matrix, S, in SE methods is set equal to the identity matrix due to the NDDO (Neglect of Differential Diatomic Overlap) approximation:

µAνB λCσD = µAνA λC σC δ δ (2–21) | | AB CD   where δAB is the Kr¨onecker delta function:

1 if A = B, δAB =  (2–22)  0 otherwise.

The diagonalization of the global Fock matrix is the most expensive part of a standard SE calculation compared to the two-center two-electron integral evaluation which is the bottleneck of ab initio methods. However, subdividing the global Fock matrix in the D&C method replaces global diagonalization with subsystem diagonalizations which scales

α 3 linearly with the number of subsystems, nsub (N ) . The subsystem density matrices are used to assemble the global density matrix and the total energy is calculated using

Equation 2–11. The subsetting scheme in D&C methods is the key to its efficiency. Usually, each subsystem comprises a core region surrounded by one or more buffer regions. In protein systems, it has been shown that treating each amino acid as a core with a 4.5A/˚ 2.0A-buffering˚ scheme fits the compromise of computational efficiency and accuracy. The

D&C method is not however, the only linear scaling SE method, other methods include density matrix minimization [115], and the localized molecular orbital method [116]. Recently, Raha and Merz reported a SE D&C based scoring function, QMSCORE, [117, 118] which is capable of predicting the binding free energy of protein-ligand complexes. QMSCORE is derived using current technologies to best describe the master

equation 2–7:

gas ∆Gbind = ∆Hbind + ∆LJ6 + ∆Ssolv + ∆Sconf + ∆∆Gsolv (2–23)

43 gas The enthalpic interactions in the gas phase, ∆Hbind, between the protein-ligand were determined using semiempirical Hamiltonians such as AM1 and PM3. The attractive part of the Lennard-Jones potential, ∆LJ6, was used to represent the dispersive interactions

neglected by SE methods. The solvent entropy, ∆Ssolv, and conformational entropy,

∆Sconf , were accounted for by solvent accessible surface area (SASA) and number of rotational bonds. The solvation free energy due to complexation, ∆∆Gsolv, was calculated using a Poisson-Boltzmann continuum approach. QMSCORE was applied to 165 protein-ligand complexes including HIV protease, Serine protease, FKBP, and DHFR. Although there was a substantial increase in computational cost, it showed better performance than other scoring functions such as Autodock, DrugScore and LigScore. 2.8 Pairwise Energy Decomposition (PWD)

QM methods are frequently used to determine the electronic energy of molecular systems. Electronic energies are quantities that characterize the whole system and do not provide any information regarding the key interactions taking place. Unlike a MM force field, QM does not easily lend itself to descriptions of energetics in a pairwise fashion. However, work first done by Fischer and Kollmar using a modified CNDO (Complete

Neglect of Differential Overlap) method partitioned the energy into mono, EA, and bicentric terms, EAB [119].

N N N

ET OT = EA + EAB (2–24) A=1 A=1 B>A X X X ′ core = EA + (EAB + EAB + EAB ) (2–25) " # XA B

interactions between human Carbonic Anhydrase II, and a series of fluorine-substituted

44 ligands [7]. Similar to the decomposition by Fischer and Kollmar, the total energy can be calculated by summing the mono and bicentric terms as shown in Equation 2–25.

′ The bicentric term is comprised of a repulsive term EAB, an exchange term, EAB, and a

core core-core repulsion term EAB (Eq. 2–26).

Z Z Ecore = A B (2–26) AB R A B

The presence of the EA term in Equation 2–25 results in this formalism not being fully pairwise. This term has a large negative energy contribution to the total energy since it

contains the one-center terms as shown in Eq. 2–27.

1 A A A 1 E = P AA 2HAA + P AA (µAνA σAλA) (µAσA λAνA) (2–27) A 2 µν µν λσ | − 2 | µ ν ! X X Xλσ   ′ EAB shown in Eq. 2–28 contains all the electron repulsion, and so it is a positive contributor to the energy which comes from the diagonal block of the Fock matrix.

A A B E′ = P AAP BB(µAνA σAλA) (2–28) AB µν λσ | µ ν X X Xλσ

EAB defined in Eq. 2–29 contains the exchange between atoms and is a small negative contributor to the total energy, which stems from the off-diagonal elements of the Fock, one-electron, and density matrices. As originally described, it contains most of the binding energy.

A B 1 B A E = P AB 2HAB P BA (µAσA λBνB) (2–29) AB µν µν − 2 λσ | µ ν λ σ ! X X X X  In a biological environment it is often more convenient to partition the energy function in terms of residues or fragments such as amino acids, bases, or functional groups rather than atoms. In the HCA II study by Raha the ligands were divided as shown in Fig. 2-9. The above scheme can be modified to reflect these requirements where the total

res energy can be broken down in intra- and inter-residue terms in Eq. 2–30 where EI and Eres are outlined in Eq. 2–31 with A, B I denoting that atoms A and B are members of IJ ∈

45 residue I.

O O O S HN NH2 F

Figure 2-9. Schematic Diagram of the human Carbonic Anhydrase II inhibitor Fragmentation. The structure in blue is the sulfonamide moiety. The amide group is colored green while the flouro-substitued phenyl group is shown in red.

res res ET OT = EI + EIJ (2–30) I J

res ′ core EI = EA + (EAB + EAB + EAB ) A, B I A B

The point charges used in classical methods are derived from quantum mechanics using various approaches [122]. The Mulliken charges are the simplest (Eq. 2–32) where P and S are the density and overlap matrices.

q = Z (PS) (2–32) i i − νν ν∈i X The Mulliken charges are rarely used due to systematic errors and failure to reproduce experiment properties. Cramer and Truhlar [123–125] have parameterized methods

called Charge Model 1 (CM1) and Charge Model 2 (CM2) which are based on Mulliken charges but include corrections for errors due to sensitivity to one-electron basis sets and levels of theory used. The CM1 functional form is shown in Eq. 2–33 where f CM1 is a

46 parameterized function and BAC is the bond order between atoms A and C.

CM1 Mulliken CM1 qi = qi + f (BAC ) (2–33) AX6=C The atomic point charges used in the AMBER FF are derived using the Merz-Singh- Kollman (MK) [126] and Restrained ElectroStatic Potential (RESP) [127–129] schemes. The MK charges are generated from QM to reproduce the electrostatic potential at points

around the molecule. The fitting procedure begins by giving each atom a van der Waals radius and then forming a grid encompassing the molecule. Charges are then generated using grid points not within the van der Waals volume. The MK procedure can often lead to asymmetrical or large charges being placed on atoms which leads to problems

simulating biomolecular interactions. The RESP scheme solves these problems. 2.10 Comparative Binding Energy Analysis (COMBINE)

COMBINE was first developed by Wade and Ortiz in 1995 where they modeled

the binding free energies of a series of ligands to a receptor using a MM potential and PLS [8, 130]. The COMBINE method has been successfully applied to protein-ligand [131–134], RNA-ligand [135], protein-DNA [136], and protein-peptide [137] complexes where predictive QSAR models have been generated. COMBINE was also used in flexible

virtual screening application of Factor Xa inhibitors by Murcia [138]. The enthalpic interaction between the receptor and the ligand is approximated using a MM function and solvation interaction with Poisson-Boltzmann or Generalized-Born methods. The Hamiltonian, ∆U, in MM methods for the binding of a ligand to a

VDW ELE receptor can be described by Eq. 2–35 where uij and uij are the van der Waals and electrostatic interactions between atoms i of the receptor and atom j of the ligand.

B,A,T NB The ∆ui are the changes in bond lengths, angles and dihedrals and uii′ are the new intramolecular non-bonded interactions upon binding. The premise of the COMBINE approach is that ∆U, can be approximated by a weighted linear combination of the most important energetic interactions, ui, between the ligand and the receptor as shown in Eq.

47 2–35. It has been shown using experimental approaches such as thermodynamic double mutant cycles and pH titration that only a small number of sites of the receptor play a role in binding. On this basis, the approximation central to COMBINE can be considered reasonable.

nl nr nl nr VDW ELE ∆U = uij + uij + (2–34) i=1 j=1 i=1 j=1 X X X X nl B,L A,L T,L NB,L (∆ui + ∆ui + ∆ui )+ ∆uii′ + i=1 i

and the coefficients, wi. This is accomplished using variable selection techniques and multivariate statistics or a genetic algorithm [139]. Variable selection methods such as D-optimal selection and fractional factorial design have been used to reduce ∆U to the

so-called effective potential function, U ′. The ∆G of binding can then be approximated using Eq. 2–36 where the coefficients and the regression constant, C are evaluated using multivariate statistical methods. The constant, C, contains information common to all compounds in the series of ligands plus the terms that are neglected in the equation such as entropy. n

∆G = wi∆ui + C (2–36) i=1 X 2.11 SemiEmpirical Comparative Binding Energy Analysis (SE-COMBINE)

The SemiEmpirical Comparative Binding Energy Analysis (SE-COMBINE) [9, 140] approach was developed as a direct extension of the COMBINE and PWD approaches previously described. The interaction energy in SE methods was decomposed resulting in Eq. 2–37 with the descriptor table shown in Fig. 2-10.

48 The SE-COMBINE method was used to elucidate the most important interactions between trypsin and a series of inhibitors. Protein-ligand interaction energies are decomposed to find the most or least stabilizing interactions as well as provide a means to identify regions of significant variation (thereby targeting areas that could benefit from

more optimization). The multivariate statistical tools, PCA and PLS, were used to mine the interactions between the receptor residues and the ligand fragments to generate QSAR models. The fragmentation scheme used in this study is shown in Fig. 2-11 where each ligand contains the 3-amidino-phenylalanine moiety. The authors introduced so-called IMMs (InterMolecular interaction Maps), an example of which is given in 2-12, which

enable the researcher to graphically view where a candidate drug could be modified or optimized.

E = ( E + E′ + Ecore) + A I, B J INT I J A B AB AB AB ∈ ∈ P P P( P ∆E + ∆E′ + ∆Ecore) + A I, B K I K

Mol Act IJ-E© IJ-E IJ-Ecore IK-E© IK-E JLL-E© JL-E I-E© I-E© I-Ecore J-E© J-E J-E core   AB AB AB AB AB AB AB AB AB AB AB AAB AB   1 ......   2 ......     ......     N ...... 

Figure 2-10. SE-COMBINE Descriptor Table.

2.12 Graph Theory

Many problems in cheminformatics such as finding the shortest path from one atom to another, ring and substructure searching are solved using graph theory and recursive algorithms [40]. Graph theory is a well establish area and is commonly used in computer networking [141]. A graph G consists of a set of n vertices, V , and a set of m

49 R1

O H R2 S N O N O H

H2N NH2

Figure 2-11. Schematic Diagram of a Trypsin Inhibitor Fragmentation. The structure in blue is the 3-amidino-phenylalanine moiety (APM). The TOS group is colored green while the PIP group is shown in red. edges, E, where an edge is an unordered pair of vertices. V = v , v , v , v , v , ..., v , { 1 2 3 4 5 n} E = e , e , e , e , ..., e , and G = V, E . The order and size of a graph is the { 12 23 34 37 m} { } number of vertices, n, and the number of edges, m, respectively. The degree of a vertex v of G is the number of edges incident upon v. Connected graphs contain a route from every vertex to every other. Multigraphs (multiple bond containing molecules) are graphs which contain repeated edges between vertices while a simple graph does not contain any. A directed graph, or digraph, is a graph with directions assigned to each edge. Complete graphs are denoted by Kn and are graphs where an edge connects every pair of vertices. A labeled graph is one where the vertices and/ or edges are given labels. A weighted graph is a type of labeled graph where the labels are real numbers. A walk in G is a sequence of vertices w = [v , v , ...v ],k 1, such that [v , v ] E 1 2 k ≥ j j+1 ∈ for j = 1, ..., k 1. The walk is closed if k > 1 and v = v , and open if they are different. − k 1 A walk is called a path if there are no repeated vertices. A closed walk with no repeated vertices other than its first and last one is called a cycle. The length l of a walk is the number of edges it contains (open walks: l = s 1, closed walks: l = s, where s is the − number of vertices visited). The terms path and chain describe an open walk and a walk

50 72 68 70 88 62 71 0.0 69 66 87 63 64 60 56 52 59 29 31 23 44 41 65 46 −0.2 27 47 79 85 16 30 28 77 53 45 43 36 22 67 51 −0.4 35 82 54 39 26 5 17 80 25

Compound 40 8 7 76 38 50 12 −0.6 20 15 57 78 55 49 32 24 75 81 61 33 74 18 48 19 −0.8 21 3 73 6 1 83 34 2 13 4 10 42 58 14 37 −1.0 9 84 86 11

PHE41 PIP3IJHIS57TYR94 PIP3IJ PIP3IJSER96 PIP3IJ LEU99 PIP3IJ HIS57 APM2IJ SER96 TOS1IJASN97THR98 TOS1IJ LEU99TOS1IJ TOS1IJ GLN192 PIP3IJGLY193 PIP3IJ SER195 PIP3IJSER214 PIP3IJ TRP215 PIP3IJ GLN175ASP189 TOS1IJSER190 APM2IJGLN192 APM2IJGLN192 APM2IJ TOS1IJGLY193 APM2IJASP194SER195 APM2IJ APM2IJVAL213 APM2IJTRP215TRP215 APM2IJ TOS1IJGLY216GLY216 APM2IJSER217 TOS1IJGLY219 TOS1IJGLY219 APM2IJCYS220 TOS1IJGLY226 APM2IJWAT235 APM2IJ APM2IJ

Descriptor

Figure 2-12. Model Lig3C Intermolecular Interaction Map (IMM) of the important EAB descriptors. The key residues of trypsin that interact with the triple fragment ligand (APM, TOS, and PIP see Fig. 2-11) label the x-axis. The compounds on the y-axis are ordered with respect to activity. The activity decreases from top to bottom. The legend indicates the magnitude of the unscaled descriptor in eV.

51 e 12 v 2 v4 v C v1 6 C C v3 v5 C C C v7 N e78 v9 v8 CC (a) Sample Graph (b) Molecular Graph

Figure 2-13. Graphs to Link the Terminology used in Graph Theory and Chemistry.

in which all vertices (and edges) are distinct, respectively. Cycles and paths of size n are denoted by Cn and Pn, respectively. A block is a group of vertices such that all edges between them are involved in one or more cycles. An open acyclic vertex is a vertex that is not located between two blocks while a closed acyclic vertex is located between two blocks.

The graph in Fig. 2-13(a) contains a cycle, R = R , R where R = v , v , v { V E} V { 7 8 9} and R = e , e , e , of type C . R is a subgraph of G where the vertices and edges of E { 78 89 97} 3 R are subsets of G or in other words R is isomorphic to a subgraph of G. Reversely, G is a

supergraph of R. The determination if the graph G1 is isomorphic to a subgraph of G2 is a known as the subgraph isomorphism problem which is NP-complete (Non-deterministic Polynomial time). The term clique is used for a set of vertices where an edge exists between each pair. A clique is a subgraph of G and itself is a complete graph. A k-clique is a clique of order k. Clique detection or maximum common subgraph isomorphism

is a method to find the largest subgraph of G1 isomorphic to a subgraph of G2. The subgraph isomorphism and maximum subgraph isomorphism problems are known in cheminformatics as substructure searching and pharmacophore mapping. A molecule can be represented as a graph where the atoms are the vertices and bonds are the edges as in Fig. 2-13(b). This is a labeled or colored graph, in other words each vertex is labeled with

the element type and each edge is colored with the bond order. The similarity between the two structures (graph and molecule) can be seen in Figures 2-13(a) and 2-13(b). A dictionary of terms is compiled in Table 2-1.

52 Table 2-1. Correspondence between Graph Theory and Chemical Terminology. Graph Theory Chemistry Connected Graph Molecule Graph Order Number of Atoms Graph size Number of bonds Vertex Degree Number of bonded atoms Leaf Vertex Terminal atom Closed path/ Cycle Ring Cycle Type Ring Size Chain Chain Block Cluster of Rings Subgraph Isomorphism Problem Substructure Searching Maximum Common Subgraph Isomorphism pharmacophore mapping

A tree, T = (TV , TE), is a connected acyclic graph. Trees contain leaves which are vertices of degree 1 and non-leaf vertices. A root is a vertex where all edges point away from it. A forest is a set of disjoint trees while a “k-ary tree is a rooted tree in which every vertex has k children”. Trees are often used in conformational searching and other combinatorial problems. A matching or edge independent set, M, of G is a subset of the edges, such that no two edges in E shares a vertex. There are three types of matching called maximum, maximal, and perfect. A maximum matching is a matching of highest cardinality. Maximal matching is a matching where no other edges can be added, while a perfect matching contains all vertices of the graph. A matching is maximum if and only if it has no augmenting path. An augmenting path is an alternating path which starts and ends with free or unmatched vertices. An alternating path describes a matching where the edges are alternately in M and not in M. For molecular graphs the maximum weighted matching algorithm is a technique of assigning double and triple bonds and corresponds to maximizing the number of double bonds in a pi system [142].

There are various ways of traversing or searching a graph. One such technique is the depth-first search. This is implemented as a recursive routine and tracks which vertex and edge are encountered thus only visiting each once.

53 v1 v3 v2 v4

v5 v6 v10

v7 v9 v8

(a) Molecular Graph (b) Graph

v1 v3 v1 v3 v2 v4 v2 v4

v5 v5 v6 v10 v6 v10

v7 v9 v7 v9 v8 v8

(c) Maximum Matching (d) Maximal Matching

(e) Kekule Structures of Benzene

Figure 2-14. Graphs to Link the Terminology used in Graph Theory and Chemistry.

2.13 Statistical Methods

All RBDD and LBDD methods use statistical methods to correlate with or predict binding free energies. Common statistical measures or tools include mean (Eq. 2–38), centering (Eq. 2–39), sample variance (Eq. 2–40), standard deviation (√σ2), Z-score (Eq.

2–41), and covariance (Eq. 2–42) where Ei is the observable quantity for element i and N is the total number of observables.

1 N E = E (2–38) h i N i i=1 X E = E E (2–39) i i −h i 1 σ2 = E E 2 (2–40) N 1 i −h i i − X E E Z = i −h i (2–41) score σ

54 1 Cov = (E E )(E′ E′ ) (2–42) N 1 i −h i i −h i i − X MLR is an extension of the ordinary least squares method where more than one independent variable is used to derive the QSAR model, which takes the following form:

Y = BX + E (2–43)

where B is the matrix of regression coefficients and E are the residuals. The quality of the fit can be accessed using the Pearsons correlation coefficient, r, as shown in Eq. 2–44.

N (x x )(y y ) r = i=1 i −h i i −h i (2–44) N 2 N 2 [ P(xi x ) ][ (yi y ) ] i=1 −h i i=1 −h i q The square of the Pearsons correlationP coefficient,P r2 or R2, is often reported and describes the goodness of fit. Cross-validation or jack-knifing is a technique that checks the quality of the fit. It removes some of the dependent variables from the data set and derives a model with the remainder. It then predicts the values of the data that have been left out. The PRESS value (Eq. 2–45) is the residual sum of squares of the data left out and is used in the calculation of q2 or Q2 (Eq. 2–46), or cross-validated R2, which is the measure of predictability. Many different forms of cross-validation can be used but the

most common is the ‘Leave-One-Out’ or LOO scheme, where each dependent variable is left out and predicted in turn.

N PRESS = (y y )2 (2–45) pred,i − i i=1 X PRESS Q2 =1 (2–46) − N (y y )2 i=1 pred,i −h ii 2 Together with the correlation coefficient,P R , and the cross-validated correlation coefficient, Q2, the standard deviation of error of calculations, SDEC, and the standard deviation of error prediction, SDEP are used to assess the quality of the model. SDEP can also be defined as the root mean squared error of the dependent variables in a LOO scheme or external data set as shown in Eq. 2–47. Similarly, SDEC is calculated for those

55 variables used to build the model or training set.

PRESS (2–47) r N Unsigned or absolute error (Eq. 2–48) and signed error (Eq. 2–49), mean squared error

(Eq. 2–50) and Root Mean Squared Deviation (RMSD, Eq. 2–51) are also often calculated to measure the quality of the models.

abs (E E′) ae = i i − i (2–48) N P (E E′) se = i i − i (2–49) N P (E E′)2 MSE = i i − i (2–50) N P (E E′)2 RMSD = i i − i (2–51) s N P PCA is a method to reduce the dimensionality of the descriptor space by generating

linear combinations of the original descriptors called Principal Components (PCs) that best describe their variance. Usually the number of PCs is smaller than the number of descriptors and in doing so reveals the “underlying factors or combinations of the original variables that principally determine the structure of the data distribution” [143]. Generally in PCA the X matrix is “autoscaled” (Eq. 2–41) or in other words each

descriptor is processed to have a mean of zero and a standard deviation of 1, ensuring certain variables do not dominate because of their magnitude. The theory of PCA stems from the eigenanalysis of the correlation matrix, C (Eq. 2–52) where X is the descriptor matrix. The descriptor matrix has dimensions of n k, where n is the number × of observations and k is the number of measured variables. C is a square and symmetric matrix and so facilitates the generation of principal components or eigenvectors, P , which are orthonormal to each other. A schematic diagram of the matrices and vectors involved in PCA are shown in Figure 2-15

C = XT X (2–52)

56 CP = λiP (2–53)

The magnitude of each eigenvalue, λi, represents the variances the PC explains of the X matrix. PC = P X + P X + + P X (2–54) i i1 1 i2 2 ··· ik k PCA can also be derived in terms of the original descriptor matrix, X, as shown in Eq.

2–55 where Pi are the loading vectors or eigenvectors and ti are the score vectors and E is an error term after the descriptor matrix was reduced to q principal components.

X = t pT + t pT + + t pT + E (2–55) 1 1 2 2 ··· q q

ti = Xpi (2–56)

The score vectors, ti, determined from the X matrix and pi, are the new variables of the reduced data set and describe how the different dependent variables relate to one

another while the loadings, pi, reveal which descriptors are responsible. It is common to analyze the results of PCA using scatter plots of the scores and loadings. The similarity/dissimilarity between dependent variables is investigated using score plots. Usually, clustering of the data occurs with very similar data points grouping together and dissimilar ones being further apart. The loadings plot is used in conjunction with the score plot to determine the reasons why such clustering exists and decipher which of the original

X variables are causing it. Partial Least Squares (PLS) is a technique similar to PCR; however, it is derived in a way that the X scores, ti, explain the variation in X and correlate with Y simultaneously. PLSR transforms the X matrix into orthogonal components, so-called latent variables, and then performs a regression step that predicts Y . NIPALS, SIMPLS, and the kernel method are algorithms for calculating a PLS model; the basic algorithm was developed by Wold et al. and is outlined below and the matrices and vectors are shown in Fig. 2-16 [14, 144]. The algorithm begins by setting the vector u to one of the columns of Y , which allows the

57 descriptors K T

X Cmpds

N

P’

Figure 2-15. Principal Component Analysis (PCA) Schematic Diagram of the Matrices and Vectors Involved. calculation of the X-weights, w: XT u w = (2–57) uT u The weights are normalized (wT w = 1):

w w = (2–58) w || || From the normalized weights, the X-scores, t, are calculated:

Xw t = (2–59) wT w

From the calculated X-score the Y -weights, c, and Y -scores, u are determined:

tT Y cT = (2–60) tT t Y T c u = (2–61) cT c This method is an iterative process, which is tested on the change in t. The convergence criterion is normally set in the range of ǫ 10−6 10−8. If convergence has not been ∗ ∗ − reached the algorithm returns to the calculation of the X-weights. Note that if there is

58 only a single Y variable then the algorithm converges in a single step.

t t || old − new|| < ǫ (2–62) t || new|| Once convergence is reached the X- and Y - loadings, p and q, are calculated:

XT t p = (2–63) tT t Y T u q = (2–64) uT u To proceed to the next latent variable the X matrix (Y is optional) must be “deflated”, in other words, the current latent variable’s information must be removed as shown in Eq. 2–65 and 2–66. The total number of latent variables to consider is generally determined using cross-validation.

X = X tpT (2–65) − Y = Y tcT (2–66) − Similar to MLR (Y = BX + E), the full PLSR solution can be written where B is the regression coefficient matrix as Eq. 2–67.

B = W (P T W )−1CT (2–67)

The kernel method is an alternative to the above algorithm. Similar to PCA, this uses eigenvalue-eigenvector equations to come to the same result. For example, the X-weights are determined by taking the first eigenvector of the variance-covariance matrix

XT YY T X. Similar to the analysis of a PCA model, the interpretation of a PLSR model involves the score and loading plots, however, in PLSR the score and loading plots engage the X and Y scores and weights. 2.14 Metalloproteins

Metalloproteins are a key subset of proteins in the body which bind a transition metal. The metal ion acts as a Lewis acid (electron pair acceptor) towards amino acids

59 descriptors K T U

X Y Cmpds

N

P’ C’

W’

Figure 2-16. Partial Least Squares (PLS) Schematic Diagram of the Matrices and Vectors Involved. or other molecules which are Lewis bases (possess one or more lone pairs) and are called ligands. The bonding of ligands to a metal ion is described as dative when the ligand donates one or more lone pairs to the metal. Metals can be coordinated to any number of ligands with four, five and six being the most common in biosystems. Thus the most likely geometries include tetrahedral, square planar, trigonal pyramidal, and octahedral. The amino acids that most commonly bind to a metal ion in metalloproteins are shown in Fig. 2-17. The side chain of each amino acid is labeled with greek letters and these are used when referring to which atom of an amino acid bonds to the metal, e.g. Zn-CYS@SG would translate that the gamma Sulfur of a cysteine residue is bound to a Zinc ion. Iron, Copper, and Zinc are the most abundant transition metals in the human body. Metal ions in biological systems have both structural and functional roles. They are termed structural when no chemical reaction takes place at the metal site but aide in the stabilization of the protein structure whereas functional metalloproteins carry out chemical reactions.

60 δ

γ γ SH S β β α α O O N N

(a) Cysteine (CYS) (b) Methionine (MET) ε ε O OH δ O δ γ δ γ β OH β α α O O N N

(c) Aspartic Acid (ASP) (d) Glutamic Acid (GLU)

ε δ N γ ε β N δ α O N

(e) Histidine (HIS)

Figure 2-17. Most Common Amino Acid Residues which Bond to Metal Ions.

Zinc proteins are both structural and catalytic. Zinc acts as a superacid and promotes the hydrolysis or cleavage of chemical bonds. For example, Human Carbonic Anhydrase II

(HCA II), (Fig. 2-18(a)), catalyses the conversion of CO2 into bicarbonate or vice versa. HCA II contains a tetrahedral zinc at its active site as shown in Fig. 2-18(b). The Zinc atom is bound to three histidine residues and a water molecule (pH< 7) or a hydroxyl ion. Farnesyl Transferase (FTase) is a zinc metalloenzyme that removes the diphosphate

group from the farnesyl diphosphate substrate and connects the resulting farnesyl moiety

61 to the cysteine. The full protein structure and active site of 1QBQ are shown in figures 2-18(c) and 2-18(d). Other Zn metalloproteins include carboxypeptidase which cleaves the terminal carboxy group from peptides and alcohol dehydrogenase which converts alcohol to acetaldehyde.

Metalloproteins that contain Copper are also both structural and functional. Copper can change oxidation state and is often involved in electron-transfer reactions. Human Antioxidant protein (HAH1) contains a tetrahedral Cu(I) bound by four cysteine residues as shown in Fig. 2-19(b). HAH1 is involved in the transporting of Copper in the body and is labeled a chaperone. Amicyanin is a tetrahedral Cu(II) containing protein which binds

two histidines, a methionine, and a cysteine residue as shown in Fig. 2-19(d). This protein is called a blue copper protein due to its spectroscopic properties arising from cysteine to Cu(II) charge-transfer [145–150]. Metalloproteins can also contain multiple metals in close proximity. For example,

Aminopeptidase is a di-zinc protein from Aeromonas. proteolytica (AAP) as shown in Fig. 2-20(a) which catalytically cleaves the N-terminus of polypeptides. The active site of Aminopeptidase is shown in Fig. 2-20(b) where the zinc ions are bound to histidine, aspartic acids and are bridged with a water molecule. Urease from Bacillus pasteurii,

(Fig. 2-20(c)), is a di-nickel enzyme that catalyzes the hydrolysis of urea to ammonia and carbon dioxide. Its active site is shown in Fig. 2-20(d), where two nonequivalent Ni(II) atoms (3.5A˚ separation) are bound to two histidines each and a bridging carbamylated lysine. An aspartate residue, two waters and a bridging water/hydroxyl ion complete the coordination sphere. The geometry of both Ni centers can be described as square pyramidal and octahedral [151–154]. Both Aminopeptidase and Urease are homo-nuclear proteins but hetero-nuclear metalloproteins also exist. Copper-Zinc Superoxide Dismutase, Cu,Zn-SOD, is one such protein as shown in Fig. 2-20(a) and its active site highlighted in Fig. 2-20(b).

62 (a) Human Carbonic Anhydrase, HCA II, (PDB (b) 1CA2 Active Site ID: 1CA2)

(c) Farnesyl Transferase, FTase, (PDB ID: (d) 1QBQ Active Site 1QBQ)

Figure 2-18. Zinc Metalloproteins. The Zinc ion is shown in purple while the Oxygen atom of the water molecule in 1CA2 bound to Zinc is shown in red.

63 (a) Human Antioxidant Protein, HAH1, (b) 1FEE Active Site (PDB ID: 1FEE)

(c) Copper Amicyanin (PBD ID: 1AAC) (d) 1AAC Active Site

Figure 2-19. Copper Metalloproteins. The Copper ion is shown in grey.

64 (a) Aminopeptidase (PDB ID: 1AMP) (b) 1AMP Active Site

(c) Urease. (PDB ID: 2UBP) (d) 2UBP Active Site

Figure 2-20. Homo-Nuclear Metalloproteins. The Zinc and Nickel ions are shown in purple and grey respectively, while Oxygen atoms of water molecules are shown in red.

65 Figure 2-21. Hetero-Nuclear Metalloproteins. The Active Site of Copper-Zinc Superoxide Dismutase, Cu,Zn-SOD, (PDB ID: 1CBJ) is shown. The Zinc and Copper ions are shown in purple and grey respectively.

66 CHAPTER 3 MODELING TOOL KIT++ 3.1 Introduction

In an ideal world where one could use the most accurate theories with infinite computer power and time it is possible to design a new drug. However, in reality there is always a compromise between speed and accuracy. Figure 3-1 attempts to illustrate current research efforts where reality is marked as an “X” and progress is being made,

for example in levels of theory used in scoring functions for receptor-ligand interaction calculations and conformational sampling techniques [6]. These advances have mirrored the recent increases in computing power over the last number of years.

Time/Money

Drug

X

Theory

Figure 3-1. Computational Drug Design. In theory a drug can be design in silico however in reality it has not materialized. Current research efforts have focused on pushing the boundaries of theories used in the SBDD areas with mixed results.

With the desire to use QM techniques in SBDD firmly established there comes a need to develop software where these methods can be used conveniently in the DD process. This has led to the development of a package of C++ libraries called Modeling ToolKit++ (MTK++) to interface with common QM programs to test the applicability and validate QM methods in SBDD. MTK++ was designed from the ground up to be used in areas

67 of in silico SBDD such as molecular alignment and receptor-ligand scoring. The use of SE methods in molecular alignment and scoring is further analyzed in Ch. 4. Also this toolkit was designed with metalloproteins in mind where no such software was known to be available and will be discussed in detail in Ch. 5. All too often molecular modeling

softwares which are described as “open source” are obfuscated before release and so it becomes almost impossible to read or extend. To combat this MTK++ was developed as an in-house suite of libraries with a consistent Application Programming Interface (API) which will allow new and novel methods to be developed. This chapter describes the design and development of MTK++. The algorithms used are described in detail with

numerous illustrated examples. 3.2 Overview

MTK++ is an object oriented C++ package of molecular modeling libraries including Molecular Mechanics (MM), Genetic Algorithm (GA) , file processing and conversion (Parsers), statistical and molecular tools used in LBDD and SBDD and other computational chemistry fields. The Basic Linear Algebra Subprograms (BLAS) , Linear Algebra PACKage (LAPACK) , Boost, and xerces-c [155] libraries were used in the

development. At the time of preparation of this thesis MTK++ contained over 30,000 lines of code. Thus, a complete description of the code cannot be given; however all libraries and their major classes and algorithms are described. 3.2.1 Development

MTK++ is implemented as a C++ package of libraries. C++ was used instead of other programming languages such as FORTRAN, and C because it is an object-oriented programming language that enables abstraction, encapsulation and inheritance. C++ contains the Standard Template Library which has convenience classes such as vectors, maps, lists, etc. and external libraries such as BLAS and xerces-c for matrix-vector math and xml handling respectively. Also C++ code can be compiled on nearly any operating system and allows modular programming which makes making changes easy. Furthermore

68 C++ is backwards compatible with C and the resulting C++ code is very efficient due to its duality as a high-level and low-level language. The development and debugging was done on a 1.33 GHz PowerPC G4 Apple computer running Mac OS X 10.4.8 with 512 MB SDRAM. The gcc (4.0.1) compiler was used. The code is cross-compiler and cross-platform

compatible and was tested on both Mac OS X and Linux operating systems. 3.2.2 Library Hierarchy

Molecule

GA Graph

Parsers Utils

Statistics Minimizers

MM

Figure 3-2. Library Hierarchy as Implemented in MTK++. The Library where the tail of the arrow starts uses the library where the head of the arrow ends, e.g. The Parsers library uses the Molecule, Utils, Statistics, and GA Libraries.

Figure 3-2 shows the hierarchy of the MTK++ package. At the center of the package of libraries is a group of utility routines which are used in all other packages. These include constants definition, diagonalization functions, an indexing class for easy sorting of objects, and a class called vector3d for atomic coordinate storage and transformation.1

The Parsers library takes care of reading and writing of files and it requires the Molecule

1 The vector3d class was originally developed by Andrew Wollacott [156, 157]. Extensive functionality was added.

69 and GA Libraries. Also the Molecule library uses the Graph library for ring perception and other recursive functions. The design of the individual libraries is discussed further in the sections below. 3.2.3 Molecule Library

The core of the MTK++ package is the Molecule library and its most important classes are shown in Fig. 3-3. This library can handle multiple molecules at a time and these are stored in the collection class. The collection class also takes care of all the

elements (this information only needs to be stored once, not for every molecule), and parameter and fragment information for MM calculations. The molecules themselves are of type “molecule” and this class stores submolecule or residue information. This division is analogous to that of amino acids in protein or nucleotides in DNA or in fact fragments in small organic molecule. The submolecule class stores a list of atoms and the atom class

stores pointers to objects such as its element and coordinates which are a vector of three double precision numbers (vector3d). The parameters class stores information for MM calculations as structures, such as atom types, bond, angle, torsion, improper (force constants, equilibrium values)

and non-bonded (charges, Lennard-Jones values) parameters as shown in Fig. 3-4. The stdLibrary class is the main object which deals with the storage and function of a molecular fragment library as shown in Fig. 3-5. stdLibrary stores a list of stdGroups and a stdGroup is a storage container for stdFrag’s. For example a stdGroup could store

the 20 amino acids, each a stdFrag, of proteins or a list of functional groups in drug design. The stdFrag class contains information about it atoms, stdAtom, bonds, stdBonds, features, stdFeature, etc. The functionality available to molecules such as proteins, DNA, and small organics originate from the molecule class in the Molecule library as shown in Fig. 3-8. molecule

stores a lists of bonds, (a vector of Bond objects in C++), angles, torsions, and impropers. The connectivity information is determined in the connections class. This class can

70 metalCenter stdLibrary parameters

collection

disulfide molecule elements

submolecule

atom

vector3d element

Figure 3-3. Core Class Hierarchy of the Molecule Library as Implemented in MTK++. Solid line boxes denotes a class, while a dashed box signifies a structure. A class where the tail of the arrow starts uses or contains a class or structure where the head of the arrow ends. e.g. The elements class contains or uses the element structure.

71 atomType parameters bondParam

angParam torParam impParam LJ612Param eqAtoms

Figure 3-4. Class Hierarchy of the Parameters Component of the Molecule Class as Implemented in MTK++. Solid line boxes denotes a class, while a dashed box signifies a structure. A class where the tail of the arrow starts uses or contains a class or structure where the head of the arrow ends. e.g. The parameters class contains or uses the atomType structure.

stdLibrary

stdGroup

stdAtom stdFrag stdBond

stdLoop stdImproper stdAlias stdFeature stdRing

Figure 3-5. Class Hierarchy of the Standard Library Component of the Molecule Class as Implemented in MTK++. Solid line boxes denotes a class, while a dashed box signifies a structure. A class where the tail of the arrow starts uses or contains a class or structure where the head of the arrow ends. e.g. The stdFrag class contains or uses the stdAtom structure.

72 perceive bonds using distance and other geometric information, and also determine bonds through user defined databases of molecular structures (this is discussed further below in section 3.2.8 and appendix C). For example the connectivity of an alanine residue in a protein doesn’t need to be perceived since it is known a priori if the names of the atoms

are known. Disulfide bonds between Cysteine residues, as shown in Fig. 3-6, of proteins are automatically perceived using the parameters in table 3-1 [156]. If the Cysteine’s SG atoms are within dCutoff of each other and S S from Eq. 3–1 is less than eCutoff − Energy they are considered bonded. The protonation states of Histidine residues bound to a metal ion are also perceived using a bond distance cutoff of 2.3 Angstr¨om.˚ If the HIS@NE2

(epsilon of Histidine) atom is within this cutoff the residue is set to HID type. If the HIS@ND1 is within this cutoff the residue is set to HIE type. If both HIS@NE2 and HIS@ND1 are bonded to a metal atom within 2.3 A˚ then the residue is set to HIN type such as the bridging histidine residue in Copper-Zinc Superoxide Dismutase [158]. Bond

order, hybridization and formal charge of atoms for small molecule are determined in the hybridize class which is discussed in more detail below in section 3.3.

CB1 SG2 CYS CYS SG1 CB2

Figure 3-6. Disulfide Bond in Proteins.

Table 3-1. Disulfide Bond Prediction Parameters. Parameter Value dCutoff 2.5 ssBondReq 2.038 ssBondKeq 166.0 CBSGSGReq 103.7 CBSGSGKeq 68.0 eCutoff 30.0

73 HN N N NH

CB CB (a) HIN (b) HID

N N HN NH

CB CB (c) HIE (d) HIP

Figure 3-7. The Structural Types of the Histidine Residue.

S S = EBond + EAngle + EAngle (3–1) − Energy SG1−SG2 CB1−SG1−SG2 CB2−SG2−SG1 where :

EBond = ssBondKeq (distance ssBondReq)2 (3–2) SG−SG ∗ SG−SG − EAngle = CBSGSGKeq (angle CBSGSGReq)2 (3–3) CB−SG−SG ∗ CB−SG−SG −

Ring moieties are perceived within the rings class and each ring found is stored in a ring structure. The perception of rings is discussed further in section 3.4. The functionalize class determines which functional groups are present in a molecule using the database of fragments as defined in appendix C. The implementation details of the functionalize class is outlined in section 3.7. The fingerprint class contains rudimentary functionality for molecular fingerprinting.

A fingerprint is defined as information that describes a molecule in 1-D. The fingerprint in MTK++ is represented as a vector of integers with the following form: “atom info, bond type, # of rings ring info”. The number of atoms from Hydrogen through Iodine are stored in the first 52 positions, another 52 positions store the number of each of the

74 following bond types B-H, C-H, N-H, O-H, S-H, B-C, B=C, B-O, B-N, B-O, B-F, B-S, B-Cl, B-Br, B-I, C-C, C=C, C%C, N-N, N=N, C-N, C=N, C%N, N-O, N=0, N-P, N-Se, N=Se, O-O, C-O, C=O, O-Si, O-S, O=S, O-Se, O=Se, C-F, S-S, C-S, C=S, S-N, C-Cl, P-P, P-C, P-O, P=O, P-S, P-Se, Se-Se, C-Se, C=Se, N-Se, where “-”, “=”, “%” denote a

single, double and triple bond respectively. Finally the 105th position stores the number of rings in the molecule or fragment. The size, planarity, , heterocyclic boolean, and the number of , oxygens and sulfur atoms of each ring is also stored after the 105th position. Thus the length of the vector depends on the number of rings present in the molecule or fragment. Fingerprinting in MTK++ is primarily used in conjunction with the functionalize class. Fingerprints are used to screen out fragments that could not be apart of a molecule based on elements, bond types, and rings present, thus speeding up the functionalization of molecules. A pharmacophore is commonly defined as the three dimensional geometric arrangement of molecular features that are necessary for biological activity. Pharmacophores between two molecules are detected using a feature (H-bond Donor/Acceptor, Pi Center, Positive/Negative Center, Hydrophobicity) matching algorithm in the pharmacophore class. The features common to both molecules are stored in a clique structure. A full description of the algorithm implemented in MTK++ is outlined in section 3.8. The protonate class carries out the addition of Hydrogen atoms to macromolecules (proProtonate), ligands (ligProtonate), and water (watProtonate) molecules. proProtonate uses information from user defined libraries to add Hydrogens while ligProtonate is used when no such library is available. Water molecules often surround structures derived from

X-ray crystallographic data but no Hydrogen atom positions are provided. Hydrogens are added separately to water molecules after they are added to macromolecules and ligands. The algorithmic details of the three protonate classes are described in section 3.5. Conformational searching of drug-like molecules is carried out in the conformer

class using graph theory methods. Each conformer of a molecule is stored in a conformer

75 structure. The internal workings of this class are described in section 3.6. A integral part of conformational searching is determine the amount of conformational space sampled. To determine this requires being able to superimpose a conformer onto some reference structure and calculate the root-mean-squred deviation. The superimposition of two

molecules is carried out in the superimpose class and is discussed below in section 3.9. The selection class is used to parse strings that represent subsets of molecular data and is essential in providing an API for users of MTK++. The data structure in the Molecule library has an inherent hierarchal nature. Atom information is stored in submolecule; bonds, angles, torsions, impropers, and submolecules are stored in molecule

and finally all molecules are stored in collection. The atom class is at the bottom of the hierarchy, while collection is at the top. Thus to retrieve for example all atoms which a specific name in all molecules of the collection would require a certain syntax. The syntax used in the selection class resembles that of a UNIX operating system such as

“/collection/molecule/submolecule/atom” For example, providing the following string: “/COL/MOL/ALA-10/.CA.” would select the atom “.CA.”, alpha carbon, in alanine with residue number 10 (ALA-10) and that’s part of the molecule named MOL in the COL collection. The “/” on the left hand side of the string assumes that the selection

is starting from the top of the structural hierarchy. The following selection string does not begin with a slash: “ALA-10/.CA.” and represents parsing the hierarchy from the bottom up; all alpha carbons of molecules in the collection with alanine at position 10 will be selected. This syntax can handle molecule/ submolecule/ atom names, numbers, or a combination (name-number), such as ALA-10.

The atomTyper class assigns molecular mechanics atom types to the atoms of a molecule using user defined fragment libraries such as those in appendix C. The hydrophobic regions of molecules is determined using an atom additive approach as outlined by Wang and Zhou in 1998 [159].

76 Bond Angle Torsion Improper ring

hybridize selection

connections rings fingerprint

superimpose molecule pharmacophore

atomTyper protonate conformers functionalize

pro lig wat conformer clique stdLibrary

Figure 3-8. Class Hierarchy of the Molecule Component of the Molecule Class as Implemented in MTK++. Solid line boxes denotes a class, while a dashed box signifies a structure. A class where the tail of the arrow starts uses or contains a class or structure where the head of the arrow ends. e.g. The molecule class contains or uses the hybridize class.

3.2.4 Graph Library

The graph library contains classes as shown in Fig. 3-9 to handle molecular graphs as described in Chapter 2.12. This library is used to find rings and to determine whether graphs are isomorphic. Also it is used to traverse the torsional tree for systematic conformational searching. Tree and graph traversal is carried out using the depth-first

search algorithm. The graph class stores both vertices and edges with the edge struct storing pointers to two vertex objects. Both vertices and edges store a boolean to describe whether each has been visited during a traversal and a numerical variable to describe its color or label. The vertex class also stores a list of its neighbors and which layer it is

placed on.

77 vertex

edge

graph

Figure 3-9. Class Hierarchy of the Graph Library as Implemented in MTK++. Solid line boxes denotes a class, while a dashed box signifies a structure. A class where the tail of the arrow starts uses or contains a class or structure where the head of the arrow ends. e.g. The graph class contains or uses the edge structure.

3.2.5 MM Library

The MM library contains classes and functions to carry out Molecular Mechanics minimizations as shown in Fig. 3-10. Currently, the AMBER function is used as described in Chapter 2.3. The amber class contains the driver functions for the lower level classes ambBond, ambAngle, ambTorsion, and ambNonBonded that contain the AMBER energy/gradient functions. The mmPotential class is the controller for all MM functions which could be developed. It performs all the memory allocation/deallocation. The MTK++ was designed as to easily allow the extension of its features. For example, cross terms such as bond-stretch and non-harmonic terms such as the Morse potential for bonded atoms are now possible within the MM library. This will become essential when MTK++ is used to study for example Blue Copper proteins where they contain a Copper-Sulfur bond that cannot be represented using a harmonic potential. 3.2.6 GA Library

The GA library contains classes and functions to carry out a parameter optimization using a genetic algorithm as shown in Fig. 3-11. This was initially designed to carry out conformational searching of organic molecules. However, its design is application independent. A genetic algorithm is a heuristic method whereby reaching the global

78 ambBond ambAngle ambTorsion ambNonBonded

amber

mmPotential

Figure 3-10. Class Hierarchy of the MM library as Implemented in MTK++. Solid line boxes denotes a class and a class where the tail of the arrow starts uses or contains a class where the head of the arrow ends. e.g. The amber class contains or uses the ambBond class. The orange arrow is used to represent a public inheritance relationship between classes, i.e. amber is of type mmPotential. minimum of parameter space is not guaranteed to be found. Other heuristic methods include the Monte Carlo technique. The GA library was designed in such a way as to mimic human civilization or evolution. gaWorld is the main class in the library. gaWorld can contain multiple regions or gaReg’s. Each region contains a population (gaPop) of individuals (gaInd). Each individual contains some chromosomal (gaChr) make up that in turn is described by genes (gaGene). The individual can contain multiple chromosome which in turn can contain multiple genes. For each application of this GA the fitness of each individual must be evaluated. This energy function is provided by the user to the library. The “survival of the fittest” model is used for individuals to propagate or survive between generations. Individuals can survive from one iterative step to the next through a semi-random selection process biased by its fitness. Reproduction between individuals is carried out using operators such as cross-over, mutation, and averaging. The number of iterations carried out by the GA is user defined or through some convergence criteria.

79 Regions are treated as being independent, however the library was implemented in a way as to allow the “island-hopping” to be developed. This would allow the GA to run in parallel and during the course of an optimization an individual would with some probability travel from one region to another. This would allow the genetic information to be more diverse and prevent early convergence. 3.2.7 Statistics Library

The Statistics library contains classes to carry out statistical analysis as shown in

Fig. 3-12. This library is built from the boost library where matrix and vector math is perform very efficiently. The sheet class contains and handles matrix objects. The matrix class is derived from the boost matrix and extends its features by allowing for matrix labeling. The baseStats class performs the basic statistical functions as described in Ch. 2.13. The ols class carries out Multiple Linear Regression to calculate Pearson’s correlation coefficient. The pca class carries out Principal Component Analysis (PCA) and the pls class performs Partial Least Squares (PLS) modeling using the kernel algorithm with leave-N-out cross validation being implemented. The PLS algorithm produces a number of matrices during execution and so the sheet and matrix classes were essential.

3.2.8 Molecular Fragment Library

A fragment library was developed for applications including functional group recognition, molecular alignment and fragmentation of drug-like molecules for SE-COMBINE approach. The library currently contains over 300 fragments. Fragment names, internal codes and 2-D structural pictures can be found in Appendix C. This is a highly extendable library with all the tools required to add fragments available within MTK++. The fragments are built using the methodology developed to construct residues for the AMBER suite of programs. This approach uses atom names and types from the Generalized AMBER FF (GAFF) [160] and HF/6-31G* Merz-Kollmann/RESP charges as described in Ch. 2.9. The use of RESP easily allowed for symmetric atomic charges such as the oxygen atoms in a carboxylate group and for fragments to contain integer charge.

80 gaWorld gaOutput gaSelection

gaGaussian

gaReg gaOperators gaCrossOver

gaPop gaAverage gaMutate

gaInd gaChr gaGene

Figure 3-11. Class Hierarchy of the GA Library as Implemented in MTK++. Solid line boxes denotes a class and a class where the tail of the arrow starts uses or contains a class where the head of the arrow ends. e.g. The gaWorld class contains or uses the gaReg structure.

81 baseStats

ols pca pls

table boost

sheet

Figure 3-12. Class Hierarchy of the Statistics Library as Implemented in MTK++. Solid line boxes denotes a class. A class where the tail of the arrow starts uses or contains a class or structure where the head of the arrow ends. e.g. The pls class contains or uses the baseStats class. The blue arrow is used to represent a public inheritance relationship between classes, i.e. pls is of type baseStats.

3.2.9 Parsers Library

The Parsers library contains classes to read and write molecular file types. XYZ, MOL, MOL2, PDB, SD, and ZMAT file formats are supported. All classes inherit baseParser as shown in Fig. 3-13. baseParser controls the error handling of all classes in a uniform way. The xml file parsers also inherit xmlConvertors and domErrorHandler from the xerces-c library which deal with errors in the xml files. Input and output files of programs such as DivCon and Gaussian (both cartesian and internal coordinates) are handled. The element parser reads the elements.xml file stored in the MTK++ distribution and populates the elements object which the collection class stores. For each atom in the periodic table the following information is stored: atomic number, mass,

82 group, period, valence, full shell size, covalent radius, van der Waals radius, Pauling’s electronegativity value, and which semiempirical Hamiltonians are available to a given atom. The stdLib parser handles the library xml files described in the previous section and populates the stdLibrary and stdGroup classes. The param parser handles the parameter

files associated with the fragment library such as parm94 and GAFF from AMBER. The GA parser handles the files associated with the GA library of MTK++. The amber parser can export and import AMBER topology and coordinate files.

DivCon gaussian sd xyz zmat

pdb mol2 baseParser amber stats

element param mol stdLib ga

xmlConvertors/domErrorHandler

Figure 3-13. Class Hierarchy of the Parsers Library as Implemented in MTK++. Solid line boxes denotes a class and a class where the tail of the arrow starts uses or contains a class where the head of the arrow ends. e.g. The element class contains or uses the xmlConvertors class. The blue arrow is used to represent a public inheritance relationship between classes, i.e. pdb is of type baseParser.

3.3 Hybridization, Bond Order and Formal Charge Perception

It is often the case in SBDD that the design process starts with an x-ray crystal structure of a target molecule in complex with some bound substrate. This poses the challenge of determining atomic hybridizations, formal charges and bond orders of the

83 small molecule due to the fact that there are no Hydrogen atoms present. Numerous algorithms have been published [161–165] but the algorithm by Labute in 2005 [142] to perceive atom hybridizations, bond orders and formal charges of drug-like molecules was implemented in MTK++ as it was shown to be superior to the others. Other methods to

perceive atom types and bond information include antechamber by Wang et al. [166]. The Labute algorithm takes ten steps to determine the atom hybridizations, formal charges, and bond orders. A ligand that binds to PPARγ (PDB: 1FM9) as shown in Fig.

3-14(a) is used to illustrate the algorithm where x1, ..., xn denote the 3D coordinates of n atoms with atomic number Z1, ..., Zn, and the number of bonded atoms for each atom is Q and r = x x is the distance between two atoms. i ij | i − j| Bonds are perceived by first producing a candidate list and then refining it using

geometry. Covalent radii, Ri, from Meng [161] shown in table 3-2 are used in Eq. 3–4 to determine the candidate bond list. Then for each atom, i, a “dimension”, di, is assigned based on a principal component analysis of the Gram Matrix, D, defined in Eq. 3–5 where i is the current atom index, k is the number of bonded atoms andq ¯ is the geometric center as shown in Eq. 3–6.

Table 3-2. Meng Atomic Covalent Radii. Atom Radii Atom Radii Atom Radii Atom Radii H 0.23 P 1.05 Ni 1.5 Nb 1.48 He 1.5 S 1.02 Cu 1.52 Mo 1.47 Li 0.68 Cl 0.99 Zn 1.45 Tc 1.35 Be 0.35 Ar 1.51 Ga 1.22 Ru 1.4 B 0.83 K 1.33 Ge 1.17 Rh 1.45 C 0.68 Ca 0.99 As 1.21 Pd 1.5 N 0.68 Sc 1.44 Se 1.22 Ag 1.59 O 0.68 Ti 1.47 Br 1.21 Cd 1.69 F 0.64 V 1.33 Kr 1.5 In 1.63 Ne 1.5 Cr 1.35 Rb 1.47 Sn 1.46 Na 0.97 Mn 1.35 Sr 1.12 Sb 1.46 Mg 1.1 Fe 1.34 Y 1.78 Te 1.47 Al 1.35 Co 1.33 Zr 1.56 I 1.4 Si 1.2

84 0.1

3 (e.g. tetrahedral or sp d). The di numerical values for 1FM9 are shown in Fig. 3-14(b).

Following the assignment of di an upper bound, Bi, for the number of bonds allowed by an atom is determined using di and Zi as shown in Table 3-3. Only the shortest Bi are retained. At this point all atom hybridizations and bond orders are set to zero or undefined. The next phase assigns obvious hybridizations based on d, Z, and Q. Each

Table 3-3. Labute Algorithm Upper Bound Bond Conditions.

Bi Condition 0 di =0 1 Zi < 3(H,He) 2 di =1,Zi > 2 (sp hybridized and linear) 2 nd 3 di =2,Zi < 11 (sp hybridized for 2 row elements) 3 4 di =2,Zi > 10 or di =3,Zi < 11 (square planar or sp hybridized) 7 otherwise

row of table 3-4 is carried out one at a time with each row only being applied to atoms with unassigned hybridization resulting in Fig. 3-14(c). Only atoms with unassigned hybridizations have d< 3, Z = (C,N,O,Si,P,S,Se), Q< 4 and at least one bonded neighbor with an unassigned hybridization. At this stage all bond orders, bij in which atom i or j has non-zero hybridization are set to 1. A dihedral test is used to identify bonds of order 1. The smallest out-of-plane dihedral is computed using: min P , π P , π P . If this dihedral is j,k | ijkl| | − ijkl| | − − ijkl| ◦ greater than 15 then bij is set to 1. Results of this step are shown in Fig. 3-14(d).

85 Table 3-4. Labute Algorithm Atom Hybridization Assignment. hybridization Condition 3 sp Zi =1, 2 3 sp d Qi > 4,Zi = (Group 5) and Qi =5,Zi = Group 4,5,6,7,8 3 2 sp d Qi > 4,Zi = (Group 6) and Qi =6,Zi = Group 4,5,6,7,8 3 3 sp d Qi > 4,Zi = (Group 7) and Qi =7,Zi = Group 4,5,6,7,8 3 2 sp d Qi =4,Zi > 10,di =2 3 2 sp d Zi = (Transition Metal) 3 2 sp d Qi > 4,Zi > 10 and not Si, P, S, Se 3 sp Qi > 4,Zi > 10 and Si, P, S, Se 3 sp (Qi =4) and (Qi = 3, di = 3) 3 sp Qi > 2,Zi = Group 6,7,8 3 sp Zi not (C,N,O,Si,P,S,Se) sp3 All atoms where none of its bonded atoms have zero hybridization

The following table 3-5 of lower bound single bond lengths and x x > L 0.05, | i − j| ij − where Lij is the reference bond length, are used to identify single bonds. The bonds identified using this step are shown in Fig. 3-14(e)

Table 3-5. Labute Algorithm Lower Bound Single Bond Lengths. bond dist bond dist C-C 1.54 C-N 1.47 C-O 1.43 C-Si 1.86 C-P 1.85 C-S 1.75 C-Se 1.97 N-N 1.45 N-O 1.43 N-Si 1.75 N-P 1.68 N-S 1.76 N-Se 1.85 O-O 1.47 O-Si 1.63 O-P 1.57 O-S 1.57 O-Se 1.97 Si-Si 2.36 Si-P 2.26 Si-S 2.15 Si-Se 2.42 P-P 2.26 P-S 2.07 P-Se 2.27 S-S 2.05 S-Se 2.19 Se-Se 2.34

After steps 5 and 6 the hybridizations of all uncharacterized atoms not involved in a bond of unknown order are set to sp3 as shown in Fig. 3-14(f). A molecular graph is formed including only atoms (vertices) that have undefined hybridization and bonds (edges) that have unknown order. This graph is then divided into

86 components or subgraphs. Each subgraph is analyzed independently and bond orders are assigned as shown in Fig. 3-14(g). Edge weights are assigned with the following equation w = u + u +2δ(r < L 0.11) + δ(r < L 0.25) using the atom parameters, u, ij i j ij ij − ij ij − defined in Table 3-6 (3rd and 4th row elements are mapped to the corresponding 2nd row

with 0.1 been subtracted, -20.0 for all other atoms). Results are shown in Fig. 3-14(h).

Table 3-6. Labute Algorithm Bond Weights. atom Q=1 Q=2 Q=3 C-O 1.3 4.0 4.0 C-N -6.9 4.0 4.0 C 0.0 4.0 4.0 N-C-O -2.4 -0.8 -7.0 N-C-N -1.4 1.3 -3.0 N 1.2 1.2 0.0 O-C-O 4.2 -8.1 -20.0 O-C-O 4.2 -8.1 -20.0 O 0.2 -6.5 -20.0

A Maximum Weighted Matching Algorithm as described in Chapter 2.12 is employed to find the best arrangement of double/triple bonds in each subgraph resulting in the pattern shown in Fig. 3-14(i). Ionization states and formal charges are perceived from

the connectivity and bond order. The formal charge of atom i, fi, is calculated as follows: f = c o + b , where: c is the atom group in the periodic table, o is the nominal octet i i − i i i i (2 for Hydrogen, 6 for Boron, 8 for Carbon and all other sp3 atoms in groups 5,6,7,8) and bi is the sum of the atom bond orders. The final stage of the algorithm determines the correct bonding and charge state for the following functional groups: nitro, alphiatic amines, carboxylic acids, sulfonic acids, phosphonic acids, amidines, guanidines, and sulfonamides. 3.4 Ring Perception

The algorithm used is in close agreement with that published by Fan, Panaye, Doucet, and Barbu in 1993 [167]. The functions contained in rings determines the smallest set of smallest rings (SSSR) from a molecule graph. The SSSR of a molecule is represented

87 O 1 O 2 2 2 1 2 C C C 2 O 1 2 C C C C O 2 2 N N 2 O 2 2 N 2 2 O C C N O 2 C C 2 2 2 O C C 2 2 2 C C N 2 C 2 1 O N C O 2 2 C C 2 C C 2 C 2 C 2 2 C C 2 C 2 2 2 C C

2 2 C C C 2 (a) START (b) Step 1

O 0 0 0 O 0 0 0 0 O 0 0 O 0 N 0 O 0 N 0 0 0 O 0 0 N O N 0 0 0 0 0 0 O N 0 0 0 O 0 0 0 N O 0 0 0 0 0 0 0 0

0 0 0 (c) Step 3 (d) Step 5

O O sp3 sp3 sp3 O O sp3 N O sp3 N N N O O sp3 O sp3 N O N O

sp3

sp3 sp3

sp3 sp3

sp3 (e) Step 6 (f) Step 7

O O 10.2 8 8 10.2 O O 10 10 N O -1 N 8 O 10 8 8 10 -4.1 5.2 8 10 N O N O 3.2 10 6.2 10 8 10 10 10 (g) Step 8a (h) Step 8b

O

O

O N O HN O

O N O

N O

N O

(i) Step 8c (j) END

Figure 3-14. Hybridization, Bond Order, and Formal Charge Perception Using the Labute Algorithm.

88 as S(m1, m2, ...) where mi are the ring sizes. Take for example the following molecule shown in Fig. 3-15(a) with all open acyclic nodes highlighted in Fig. 3-15(b) are removed resulting in the structure shown in Fig. 3-15(c). Then all closed acyclic nodes are removed as highlighted in figures 3-15(d) and 3-15(e). The structure is then separated into blocks

as shown in Fig. 3-15(f). The question then arises how many rings are there in the current block as shown in Fig. 3-15(g)? Allowing the first node to be the root node, numerous ring systems can be found including R1 = v , v , v , v , v , v , R2 = V { 1 2 3 15 13 14} V v , v , v , v , v , v , v , v , v ,c , ..., Rn . The closed path found containing the { 1 2 3 15 16 10 11 12 13 14} V root node is recursively searched until it can not be reduced further, in other words

1 RV is found as shown in 3-16. Once an irreducible closed path is found all nodes with two links are removed. Nodes 2, 1, and 14 are then removed. The algorithm then picks another root node and the next ring is found until all rings are found. Once all rings are found in a molecule, an aromaticity test is applied. The algorithm used is in close

agreement with that published by Roos-Kozel, and Jorgensen in 1981 [168]. Rings are classified as aromatic (AR), antiaromatic (AA), or nonaromatic (NA). A ring is assigned to be nonaromatic if it contains no intra-ring double bonds, contains a quaternary atom, contains more than one saturated carbon, contains a monoradical, or contains a sulfoxide

or sulfone. A ring system is aromatic if and only if it contains 4n+2 (n=0,1,2,3,4,...) pi electrons (H¨uckel rule) and is planar (10 ◦ tolerance). The number of pi electrons of a ring is determine using the following rules: cationic carbon and boron contribute 0, saturated heteroatoms give 2, an anionic carbon has 2, and atoms on intra-ring pi bond contribute 1. If a rings contains exocyclic pi bond(s) (Carbon double bonded to a heteroatom), then 1

pi electron is removed. Some rings correctly perceived by this algorithm are shown in Fig. 3-17. All are perceived to be aromatric except for cyclooctatetraene (COT). COT contains alternating single and double bonds but it is non-planar and is correctly assigned to be antiaromatric.

89 (a) Step 1 (b) Step 2

(c) Step 3 (d) Step 4

(e) Step 5 (f) Step 6

14 12 13 1 11

15 2 10 9 3 16

4 8 6 5 7 (g) Step 7

Figure 3-15. Ring Perception.

90 14 12 13 1 11

15 2 10 9 3 16

4 8 6 5 7

14 12 13 1 11

15 2 10 14 12 9 13 3 16 1 11 4 8 15 6 2 10 5 7 9 3 16

4 8 6 14 12 5 7 13 1 11

15 2 10 9 3 16

4 8 6 5 7

14 12 13 1 11

15 2 10 9 3 16

4 8 6 5 7

Figure 3-16. Ring Perception Step 8.

91 The ring centroid, plane and normal are also calculated for uses in pharmacophore matching and molecular alignment which will be discussed later in this chapter. The

centroid is calculated using Eq. 3–7 where k is the size of the ring and qi are the atomic coordinates. The ring plane and normal are computed by carrying out the principal component analysis of the Gram matrix as previously described in section 3.3. Matrix D is evaluated, Eq. 3–8, and then diagonalized with the first two eigenvectors defining the ring plane and the third being the ring normal.

1 k c = q (3–7) k i i=0 X k D = (q c)(q c)T (3–8) i − i − i=0 X

3.5 Addition of Hydrogen Atoms to Molecules

The addition of Hydrogen atoms to proteins, DNA or water molecules is carried out using a predefined library. Small molecules with known atom hybridization, ring information and bond orders are dealt with using the following algorithm. First, Hydrogen atoms are added to polar (N, O, and S) atoms, followed by ring systems, then terminal

atoms. All other unprotonated atoms are dealt with at the end. The number of Hydrogen atoms to add is defined by the current valence and the ideal full shell value of the atom to which the Hydrogen will be added. The bond lengths used are defined in Table 3-7. The only distinction of type of atom

to which a Hydrogen is added is the element type, in other words the bond distance for a Carbon sp3 or sp2 to Hydrogen is the same. The angle to which a Hydrogen atom is added is defined either by the hybridization or type of the connecting atom or the type of bond between the connecting atom and 1-3 atom as shown in Table 3-8. The dihedral angle to add a Hydrogen atom is the most complex component of this algorithm. Suppose you

want to add a proton, A, on to atom B which is bonded to atom C and is 1 3 bonded −

92 (a) Benzene (AR) (b) Anthracene (AR)

H H

(c) Cycloheptatriene (AR) (d) Cyclopentadienyl Anion (AR)

S N (e) Pyridine (AR) (f) Thiophene (AR)

N N N

N N N (g) (AR) (h) Purine (AR)

H N S

NH NH N

O S (i) 2-thioxo-2,3-dihydropyrimidin-4-one (j) imidazo-pyridine-3-thione (AR) (AR)

(k) Cyclooctatetraene (AA)

Figure 3-17. Ring structure which are correctly assigned aromatic (AR), non-aromatic (NA) and anti-aromatric (AA). to atom D. First a list is compiled of all torsional angles XBCD already occupied, where X is any heavy atom bonded to atom A. The dihedral then used is defined by the hybridization of atom B and built using atoms BCD. If B is sp3 hybridized then a Hydrogen atom is placed 120◦ from other bonded atoms. Dihedral angles of 0◦ and 180◦ are used when B is sp2 and 180◦ for sp hybridized. Aromatic rings are a special case of sp2 hybridized atoms where only a torsion of 180◦ is allowed. The dihedral values of polar

Hydrogens are optimized to maximize intra-molecular Hydrogen bonding using Eq. 3–9

where θD−H−A is the angle between the donor, Hydrogen and acceptor atoms and rH−A is

◦ the Hydrogen-acceptor distance. If θD−A−AA or θD−H−A is greater than 90 , or rD−A is less

93 than 3.5A˚ then no Hydrogen bond is considered.

2 E = cos2 (θ ) e(−(rH−A−2.0) ) (3–9) HB D−H−A ∗

DH

A AA

Figure 3-18. Hydrogen Bond.

Table 3-7. Hydrogen Bond Lengths. atom BondLength(A)˚ C 1.09 N 1.008 O 0.95 S 1.008 Se 1.10 Default 1.05

Table 3-8. Hydrogen Bond Angles. atom / Bond Angle (Degrees) sp / triple 180.0 sp2 / double 120.0 sp3 / single 109.47 Aromatic Ring ((360 ((ringSize 2) 180)/ringSize)/2) Default− 109.47− ∗

3.6 Conformational Sampling

Conformational searching of drug-like molecules in MTK++ is carried out using a systematic approach. GAFF [160] atom types are assigned to the atoms in a particular molecule using ANTECHAMBER and CM2 charges are generated using the DivCon program. The atomic hybridizations and bond orders defined in the hybridize class are used to mark which single or double bonds are rotatable. If either of the atoms in a bond are described as terminal then the bond is removed from the list of rotatable bonds. If both of the atoms are members of a ring then the bond is also removed, thus removing

94 Table 3-9. Hydrogen Bond Dihedrals. atom Dihedral (Degrees) sp 180.0 sp2 0.0, 180.0 sp3 120.0 Aromatic Ring 180.0

ring flexibility. The incorporation of ring flexibility is planned in later releases of the MTK++ code. Then for each rotatable bond that remains a torsion is sought after. The

total number of molecular conformations, Nconformers, is then defined by Eq. 3–10, where i is a rotatable bond index, R is the range of the associated torsion (0 360◦), and δ is − the rotation increment (120◦ for sp2 sp3). The increment currently used are tabulated in − Table 3-10. Once the number and location, as shown in Fig. 3-19, of each rotatable bond is determined a graph is formed as described in Fig. 3-20 [169]. Each rotatable bond is

defined as a layer with each unique torsional value information contained in a vertex upon this layer. Graph edges are then defined between each vertex of one layer to every vertex

one layer below. Once formed the graph is traversed and the AMBER energy, EMM , is calculated for each conformer. The lowest energy conformers are stored, based on some

user provided criteria, for later use.

n R N = i (3–10) conformers δ i=1 i Y

Table 3-10. Dihedral Angles Available based on Bond Type. Bond Type Angles sp3 sp3 60, 180, 300 − sp2 sp2 0, 30, 150, 180, 210, 330 sp2 − sp3 0, 30, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330 −

In Fig. 3-21(a) we have an organic molecule which binds to the Peroxisome

Proliferator-Activated Receptor γ. This is a functional group rich molecule containing phenyl rings, a carboxylate, a heterocycle (2,4,5-oxazole), a ketone, an amine, and an ether moiety. This structure has 12 rotatable bonds as shown in Fig. 3-21(b). Using

95 O sp2 sp3

sp2 sp3 O

Figure 3-19. Rotatable Bond Types.

Torsion 1: sp3 sp3 − 60 180 300

Torsion 2: sp3 sp2 − 0 30 60 90 120 150 180 210 240 270 300 330

Torsion 3: sp2 sp2 − 0 30 150 180 210 330 Figure 3-20. Systematic Conformational Searching. Torsion 1 forms the first layer containing three values or vertices. Followed by layer containing 12 vertices and finally the third layer with six vertices. This graph would result in 216 conformers been formed. the torsion resolution definitions in table 3-10 would lead to 4,353,564,672 conformers!

Even on modern computer hardware this number is too large. Taking a closer look at this structure, Fig. 3-21(c), the symmetry of the functional groups becomes apparent.

For example, the carboxylate group, shown in green, is C2 symmetric since the negative charge is not solely placed on one of the oxygen atoms and the phenyl group shown in yellow is also symmetric. Removing these symmetric torsions from the total number results in 544,195,584 conformers. Chemical knowledge of the torsional profile between the phenyl and oxazole groups enclosed in the red oval suggests even fewer available torsions due to conjugation. Thus reducing the total number of discrete conformers to 181,398,528. MTK++ attempts to reduce this number even further by recognizing

96 O O

6 7 O 8 O O O 9 HN 4 HN 1 5 10 N O N 2 3 O O O 11 12

(a) PPARγ Agonist (b) Number of Rotatable Bonds

O O

O O 3 4 O O HN 5 O HN N 6 N 2 O O 1 O 7

8

(c) Symmetric Regions (d) Reduced number of Rotatable Bonds

Figure 3-21. Conformer Generation.

“privileged fragments” with known torsion profiles. The tyrosine-like group enclosed in the blue oval is one such fragment and is stored in the “cores” library of the package. Removing these rotatable bonds results in 419,904 conformers. In Fig. 3-21(d) highlights the rotatable bonds: green represents unrestricted, blue are restricted by symmetry, while red bonds are frozen. The systematic approach works extremely well for molecules with approx. 12 rotatable bond or less. When the number of potential conformers exceeds two million the searching algorithm reverts to using the GA library of MTK++. During a GA search of conformational space the user is required to provide the maximum number of MM calculations which are allowed. Other searching tools such as MD and MD-LES are recommended for large peptidic molecules that bind certain proteins such HIV Protease and Endothiapepsin.

97 3.7 Substructure Searching/ Functionalize

To functionalize a molecule involves searching it for chemical substructures. Substructures searching is known as the subgraph isomorphism problem of graph theory and belongs to the class of NP-complete computational problems. Due to the NP-complete nature of substructure searching usually a screen is carried out to eliminate subgraphs that cannot be contained in the molecule. The fingerprint class in the molecule library carries out this screening process between fragments and a molecule.

The brute-force algorithm for subgraph isomorphism begins by generating the adjacency matrices A and B for the fragment and the molecule containing PA and PB atoms respectively. Then an exhaustive search involves generating P !/[P !(P P )!] B A B − A

combinations of PA and determining whether any combinations are matches to a portion of the molecule. The algorithm used is in close agreement with that published by Ullmann

[170], and Willett, Wilson, and Reddaway [171]. Ullmann first noticed that using a depth-first backtracking search dramatically increases efficiency, while Willett used a labeled graph and a non-binary connection table to increase algorithm speed. The functionality of finding substructures in molecules was developed to carry out

functional group alignment of drug-like molecules and to optimize fragment positions in drug-protein complexes during the lead optimization stage of the drug design process. The algorithmic details of the functionalize code are as follows (this example was adapted from Molecular Modelling, Principles and Applications 2nd Edition by Andrew R.

Leach [81]). Take for example a fragment and molecule shown in Fig. 3-22(a) and 3-22(b). The corresponding adjacency matrices are shown in Eq. 3–11 and 3–12. The Ullman algorithm tries to find the match between the fragment and the molecules (Fig. 3-22(c)). Mathematically this is represented as the matrix A, Eq. 3–13, which satisfies A(AM)T as shown in Eq. 3–14.

98 3 5 2 4 2 4 1 3 1 6

(a) Fragment Structure (b) Molecule Structure

2 4 1 3

3 5

2 4

1 6

(c) Alignment

Figure 3-22. Ullman Subgraph Isomorphism Illustration.

0100   1010 F =   (3–11)    0101       0010      010000   101100      010000  M =   (3–12)    010000       000100         000100      001000   010000 A =   (3–13)    000100       000001     

99 0100   001000 1010 0100         010000  0100  1010 A(AM)T =   = = F (3–14)        000100   0101   0101               000001   0010   0010                     0010    This depth-first backtracking algorithm uses a General match matrix, M that contains all the possible equivalences between atoms from A and B. The elements of this matrix, m (1 i P ;1 j P ) are such that: ij ≤ ≤ a ≤ ≤ b

1 if the ith atom of A can be mapped to the jth atom of B, mij =  (3–15)  0 otherwise.  The Ullmann heuristic states that “if a fragment atom ai has a neighbor x, and a

molecule atom bj can be mapped to ai , then there must exist a neighbor of bj, y, that can be mapped to x” and is mathematically written in Eq. 3–16.

m = x(1...P )[(a = 1) y(1...P )(m b =1)] (3–16) ij ∀ A ix ⇒ ∃ B xy jy

If at any state during the search an atom i in A such that mij = 0 for all atoms in B then a mismatch is identified as defined in Eq. 3–17 and the match is discarded.

mismatch = i(1...P )[(m =0 j(1...P )] (3–17) ∃ A ij ∀ B

The complete algorithm to perceive the functional groups in a molecule in described using pseudo code in Algorithm 3.1. The algorithm begins by reading user created fragment libraries and molecules which are to be studied. Fingerprints of each fragment

and molecule are then created. For each molecule under consideration rings, atom hybridizations, bond orders and formal charges are assigned using the algorithm previously

100 described. If Hydrogens are not present on the molecule then they are added using an algorithm described later in this chapter.

Algorithm 3.1: Functionalize Algorithm. Data: Fragment Libraries and MDL Files Result: Functional group assignment begin Read Fragment Libraries; Read in molecules to functionalize; Generate fingerprints for all fragments; for i nMolecules do Determine→ Rings; Perceive Hybridizations, Bond Order, and Formal Charge; Add Hydrogens; Generate fingerprint; for j nF ragments do bool→ bMatch1 = Compare molecule to simple fingerprint (Screening); if bMatch1 then bool bMatch2 = Match fragment to molecule using the Ullmann and Willett algorithm of subgraph isomorphism; if bMatch2 then assign fragment code to molecule; end end end end end

Then for each fragment store in memory its fingerprint is compared to that of the molecule. If there is a fingerprint match then the Ullmann/Willett algorithm is invoked. The subgraph isomorphism algorithm is outlined in Appendix A. If the subgraph isomorphism algorithm results in a match then the fragment code is assigned to the molecule. 3.8 Clique Detection/ Maximum Common Pharmacophore

As outlined in Ch. 2.12 a 3D molecular clique is defined as a group of pharmacophore points and the geometric distances between all points in that group. Fig. 3-23 illustrates the clique detection algorithm implemented in MTK++ [172]. Take for example two estrogen receptor ligands (PDBID: 1ERR and 3ERT) and finding the pharmacophore

101 points such as Hydrogen bond acceptor/donor, positive/negative charge centers, hydrophobes, rings, and ring Normals as shown in Fig. 3-23(b) certain molecular features can be mapped to one another, as shown in Fig. 3-23(c). The mapping, Eq. 3–18, results in a valid clique because the inter point distances, Eq. 3–19, are compatible within some

tolerance. However, adding the mapping M = [F F ], does not result in a valid 1 ↔ 2 clique as d1 d2 . The clique detection algorithm thus requires a method of pruning CF 6≈ CF a potentially large set of mapping which is carried out by allowing each to be a seed and growing cliques using heuristic criteria. Obviously certain seed mappings will lead to equivalent cliques however a diverse set are often found. Cliques are then scored or ranked to determine the best overall matching using Eq. 3–20 where D are the inter-point

1 2 distances and di is a distance between two features of molecule 1 and di is the distance between equivalent features in molecule 2. The function for the two mapping A A 1 ↔ 2 and B B reaches a value of 1.0 when d1 = d2 . The parameter δd controls how 1 ↔ 2 AB AB max rapidly the match score drops off as the distances becomes less compatible.

P = [A A , B B ,C C ,D D , E E ] (3–18) 1 ↔ 2 1 ↔ 2 1 ↔ 2 1 ↔ 2 1 ↔ 2

D = d1 d2 ,d1 d2 ,d1 d2 ,d1 d2 ,... (3–19) AB ≈ AB BC ≈ BC CD ≈ CD DE ≈ DE  D d1 d2 2  Score = exp i − i (3–20) − ∆d i " max # X   3.9 Superimposition

Molecular superimposition is carried out using a rigid body least squares procedure

from Kearsley [173] and Kabsch [174, 175]. The rotation matrix to minimize the sum of the squared distances between atoms of two molecules, Eq. 3–21 is solved using quaternions and eigen methods as described by Kearsley in 1989.

F = x x′ 2 (3–21) | i − i| i X

102 HO

S OH HO

O

O O

N N

(a) Estrogen Ligands (PDB: 1ERR and 3ERT.)

HO

S OH HO

O

O O

N N

(b) Chemical Features Highlighted

E2

F1 E1 D2 D1

C1 C2 F2

B1 B2

A1 A2

(c) Chemical Feature Mappings

Figure 3-23. Clique Detection Illustration.

103 A requirement of this procedure is that atom i in molecule A corresponds to atom i in molecule B. For example if you wanted to measure the rmsd between two benzoic acid conformers as shown in figures 3-24(a) and 3-24(b) would require a certain correspondence to remove artificial differences attributed to automorphisms or self-symmetry. This is

carried out by generating all matchings of non-Hydrogen atoms by type or element kind and assigning the lowest rmsd as the true value [176].

O 3 O 1 2 5 2 9 O 4 6 O 4 8 1 3 9 7 5 7 8 6

(a) Conformer 1 (b) Conformer 2

Figure 3-24. Illustration of the requirement of atom correspondence for molecular superposition.

3.10 Conclusions

This chapter has outlined the design and development of a C++ package called

MTK++. This package contains functions to handle molecular structures ranging from proteins to small molecules, that may be utilized to calculate molecular mechanics energies and gradients, to perceive atom hybridizations, and evaluate bond orders, formal charges, rings and functional groups. Utilities to add Hydrogen atoms to structures have been developed; this code was created to deal with metalloprotein systems where no other software could satisfactorily do so. MTK++ also has the capability to perform conformational searching of drug molecules using a systematic approach where the molecular mechanics code was a prerequisite. There is an obvious limitation to this approach, that is when the number of rotatable bonds increases the method becomes intractable. Tricks to improve this such as tree pruning and creating a torsional type library could be implemented. Also an algorithm to perform clique detection of molecular

104 features was implemented to superimpose two molecular species on to one another for use in ligand and receptor based drug design. Additionally, the MTK++ package contains other general purpose libraries for parameter optimization, graph utilities, and statistical methods.

MTK++ was developed with algorithms from the literature but this is a firm foundation to further develop new and novel tools for drug design and metalloprotein modeling. The remaining chapters of this thesis would not have been possible without this software. Chapter 4 utilizes MTK++ conformational searching methods to flexibly align over 80 small molecules which bind various receptors onto one another. While in Chapter

5 MTK++ was used to efficiently model metalloproteins, in particular Zn proteins. Force fields for tetrahedral Zn proteins were generated using MTK++ where previously such work would have been time consuming and prone to error if attempted by hand. The chapters further foster the growth of a bridge between the development and application of code applicable to biological problems.

105 CHAPTER 4 SEMIFLEXIBLE QUANTUM MECHANICAL ALIGNMENT OF DRUG-LIKE MOLECULES

4.1 Introduction

The placement of drug molecules into the active sites of receptors remains a challenging problem in the drug design field [177]. Docking [4, 178] is the method of choice

when there is a 3D structure of the receptor while template forcing or superimposition of a structure on to a known active molecule is used when no such structure is available [179, 180]. There have been over 20 years spent developing the tools necessary to align molecules on top one another. Most of these methods were conceived for use in

ligand-based drug design (LBDD) where no receptor structure is available, such as targets that are membrane proteins. Table 4-1 summarizes all alignment approaches from 1986 to the present where methods can be distinguished by treatment of conformational flexibility, optimization or superposition algorithm used, and the similarity metric between the two structures [181]. There are three types of flexibility encountered in these methods, the first is rigid body alignment, the second is described as semiflexible, and finally flexible alignment. The semiflexible alignment describes a technique of performing a conformational search of a ligand independent of the alignment algorithm, while fully flexible alignment tools will perform both these task at the same time.

The SE-COMBINE approach introduced in Ch. 2.11 decomposed the interaction energies of a series of inhibitors that bind trypsin [9]. That implementation of SE- COMBINE contained a number of deficiencies including the neglect of solvent and dispersive effects, ligand conformational flexibility, and from a modeling standpoint it

required the manual placement of inhibitors into the active site and the structures were fragmented by hand. This chapter describes the Conformationally unlimited Template forced Interaction energy biased Pharmacophore (CuTieP) program which was developed using MTK++ to flexibly align ligand structures into receptor active sites using a clique detection algorithm to produce trial alignments which were ranked using a semiempirical

106 (SE) score function. This hypothesis goes against the norm of Docking and molecular alignment. Here we propose using a LBDD method to generate poses for receptor based drug design (RBDD) scoring and 3D receptor QSAR. However, since the SE-COMBINE method currently can only be considered applicable for modeling a series of congeneric

compounds and a receptor this assumption can be considered valid.

Table 4-1. Compound Alignment Literature. This table was adapted from Melani et al. [182] where the alignment methods from 2003 to the present were added.

Program Similarity Optimization/ Name Criteria Superposition Mode

Sheridan[183] distance geometry of flexible

pharmacophore SEAL[184] electrostatic and steric RFO rigid ASP[185] MEP (Hodgkin function) simplex rigid DISCO[186] pharmacophore combinatorial semiflexible

points search MSC[187] physicochemical BFGS semiflexible properties TORSEAL[188] flexible

COMPASS[189] surface description neural nets semiflexible GASP[190] intermolecular matching GA flexible Energy AAA[191] distance map of combinatorial flexible pharmacophore points search

TFIT[192] inter/intra molecular MC and flexible Energy line search

Continued on next page.

107 Table 4-1. (continued)

Program Similarity Optimization/ Name Criteria Superposition Mode

PLM[193] surface overlap volume SA semiflexible Petitjean[194] electronic properties gradient method rigid Grant[195] vdW Volume analytic rigid (GFs) derivatives

FLEXS[196] interaction combinatorial flexible fields (GFs) search McMahon[197] electrostatic gradient rigid potential (GFs) method

Coss´e-Barbi[198] pattern in 3D Space stepwise approach rigid MIMIC[199] steric and electrostatic SD or NR semiflexible fields (GFs) QUASIMODI[200] electron density simplex rigid with GFs

Parretti[201] steric and electrostatic MC rigid fields (GFs) Cocchi[202] MEP, size and simplex rigid and shape descriptors

De Rosa[203] Euclidean distance in rigid Hi-PCA space Handschuh[204] geometric fit GA and flexible Quasi-Newton

3DFS[159, 205] pharmacophore GMA, Powell flexible

Continued on next page.

108 Table 4-1. (continued)

Program Similarity Optimization/ Name Criteria Superposition Mode

points RigFit[206] pharmacophore Quasi-Newton semiflexible points (GFs) SQ[207] SQ type simplex semiflexible

Klebe[208] pharmacophore Quasi-Newton semiflexible points (GFs) MIPSIM[209] MEP gradient semiflexible approach

Cosgrove[210] local surface shape clique detection rigid MutliSEAL[211] multiple flexible TGSA[212, 213] Topo-Geometrical flexible Labute[214] atom properties modified RIPS flexible

SLATE[215] distance matrix for SA flexible H-bonding and aromatics FLASHFLOOD[216] comma descriptors cluster method flexible AUTOFIT pharmacophore combinatorial flexible

points search QSSA[106] ASA GA and simplex semiflexible fFlash[217] pharmacophore points clique-based flexible FIGO[182] field interaction and simplex rigid

geometric overlap

Continued on next page.

109 Table 4-1. (continued)

Program Similarity Optimization/ Name Criteria Superposition Mode

FLUFF[218] vdW and electrostatic flexible BRUTUS[219, 220] charge distribution rigid and vdW, grid-based FLAME[221] MCP GA and BFGS flexible

GMA[222] MCP gradient-based flexible torsion space

MEP: Molecular Electrostatic Potential, PCA: Principal Component Analysis,

vdW: van der Waals, GFs: Gaussian Functions were used, GA: Genetic Algorithm, RFO: Rational Function Optimization, RIPS: Random Incremental Pulse Search, BFGS: Broyden-Fletcher-Goldfarb-Shanno, SD: Steepest Descent, NR: Newton-Raphson, MCP: Maximum Common Pharmacophore, MC: Monte Carlo,

ASA: Atomic Shell Approximation, SA: Surface Area

4.2 Implementation

The CuTieP approach can be divided into three key areas. The first is the generation of a set of conformers for the query structure that is to be aligned onto a stationary molecule called the target. Then each conformer is aligned onto the target structure and

finally the similarity between the two molecules is determined. 4.2.1 Ligand Conformational Searching

Conformational techniques to reproduce the bioactive conformation of small

molecules can be divided into two: deterministic or systematic and stochastic. The former exhaustively enumerates all conformers by defining rotatable bonds and discrete torsional angles as in the MIMUMBA program [223, 224]. The latter explores conformational

110 space using molecular dynamics, genetic algorithm [225, 226] or Monte Carlo techniques. There are pro and cons for both categories; the systematic approach is certain to sample all conformational space but the search space grows exponentially with the number of rotatable bonds. And so for large flexible molecules the stochastic approaches are favored.

Various commercial packages for the conformational searching are available including SPE, Catalyst, Macromodel, Omega, MOE, and Rubicon which were recently reviewed by Agrafiotis et al. [227] where SPE and Catalyst were more effective in sampling conformational space. The key point to note when performing conformational searching is the requirement of finding the bioactive conformation, though most often this does

not correspond to the global energy minimun [223, 228–232]. And so for this study to investigate the use of SE methods in molecular superposition a systematic approach was chosen in order to ensure that the bioactive conformer was found within some tolerance. 4.2.2 Structural Alignment and Clique Detection

If two structures contain at least three pairs of reference points then they can be aligned onto one another by minimizing the sum of the squared distances between pairs as described in Ch. 3.9. This rigid body least squares procedure from Kearsley

[173] and Kabsch [174, 175] generates a rotation matrix using quaternions and eigen methods. A clique detection algorithm described in Ch. 3.8 is employed to generate a set of correspondences between the two molecules in question which has previously been shown to be an efficient technique of producing alignments [186, 221, 222]. Each set of

reference points or clique produces trial alignments using the Kearsley algorithm. The

1 clique detection algorithm uses a score function shown in Eq. 4–1 where di is the distance

2 between two pharmacophore features in molecule 1 and di is the distance between two

equivalent features in molecule 2. The parameter ∆dmax controls how rapidly the match score drops off as the distances becomes less compatible. The features used in this clique

detection algorithm include hydrogen bond acceptor/donor atoms, positive/negative

111 charge centers, hydrophobic groups, rings, and ring normals.

D d1 d2 2 ClqScore = exp i − i (4–1) − ∆d i " max # X  

4.2.3 Semiempirical Similarity Score

The trial alignments from the previous step are scored using a semiempirical function implemented in the QMALIGN program from Dixon and Merz [233]. The QMALIGN approach is dissimilar to all other quantum similarity and alignment programs where instead of aligning based on the density, ρ(r), as described by Carb´ousing Eq. 4–2, this program aligns structures based on their wavefunctions Ψ(r). The Carb´oapproach [106]

matches the overall size and shape characteristics of molecules; however, there is no phase

information in ρ(r) thus any overlap contributes positively to ZAB even though orbitals may be orthogonal. Alignment based on molecular wavefunctions or more precisely frontier orbitals takes into account both phase and orbital information. In this application of QMALIGN only the score function (Eq. 4–3) is used where k and k′ are the mapped

Molecular Orbitals (MOs) from each molecule and φk and ǫk are the MOs and energies.

The parameter ∆ǫmax is similarly defined as ∆dmax was above. The similarity between two molecules, A and B, was then calculated as a Carb´oindex, SEScore, as shown in Eq. 4–4.

ZAB = ρA(r)ρB(r)dr (4–2) Z A B ǫk ǫk′ A B ZAB = exp − φk (r)φk′ (r)dr (4–3) ′ − ∆ǫmax k,k    Z X z SEScore = AB (4–4) √zAAzBB The complete CuTieP algorithm is outlined in pseudo code in Al. 4.1 where the user

provides two molecules and a fragment library outlined in Appendix C. Pharmacophore points are assigned to each molecule using the substructure searching tool within MTK++ and the fragment library provided. A torsional based conformational search of the flexible molecule using the torsional resolutions outlined in Ch. 3.6 is carried out where the lowest

112 energy conformers are stored based on some energy cutoff, dConf. Then each of these conformers is aligned onto the template structure using the clique detection algorithm described above. The user defines the maximum number of cliques per conformer as nMaxCLiques, and the total number of trial alignments from all conformer/clique

combinations as nT otalMCP . If a target structure is available then a MM interaction energy is calculated between the trial alignments and the reference’s receptor structure with nMM stored, thus eliminating all unreasonable structures. This step is followed by determining the SE similarity score of the flexible and template molecules using the AM1 [73] Hamiltonian with a total of nSE alignment being saved for later use.

4.3 Results and Discussion

The goal of this study was to investigate the performance of the CuTieP alignment

approach to reproduce crystallographic binding geometries of ligands in receptor active sites observed in the Protein Data Bank (PDB) [15]. 4.3.1 Data Set

To evaluate the CuTieP alignment approach 84 crystal structures of protein ligand complexes were downloaded from the PDB as outlined in Table 4-2. This set contains 12 unique receptors including Carboxypeptidase A, Glycogen Phosphorylase, Immunoglobin, Streptavidin, Dihydrofolate Reductase (DHFR), Thrombin, Trypsin, Estrogen Receptor (ER), Peroxisome Proliferator-Activated Receptor γ (PPARγ), Human Carbonic

Anhydrase II (HCA II), Elastase, and Thermolysin. This data set resembles the one used to validate the FLEXS [234] program; however, Concanavalin, Endothiapepsin, HIV-Protease, Fructose Bisphosphatase, and Human Rhinovirus receptors were omitted due to ligand size and flexibility. The ER, PPARγ, and HCA II ligands are used here but were not in the FLEXS data set. The various ligands which bind certain receptors are labeled using the lowercase form of the PDB-ID corresponding to the complex structure. The data set was split into two where the ligands in the first portion were flexibly aligned whereas the remaining ligands were rigidly aligned.

113 Algorithm 4.1: Flexible Alignment of Drug-like Molecules Data: Fragment Libraries, Template and Flexible Files Result: Flexible Molecule/s Aligned onto Template Molecule begin dConf 20 30kcal/mol; nMaxCLiques←− − 10; ←− nT otalMCP 10000; nMM 1000;←− nSE ←−100; ←− ∆dmax 0.5; Read Fragment←− Libraries; temp Read Template Molecule; ←− flex Read Flexible Molecule; findFunctionalGroups(temp);←− findFunctionalGroups(flex); assignPharmacophorePoints(temp); assignPharmacophorePoints(flex); totalConformers confSampler(flex); nConformers ←−confAnalysis(flex, dConf); ←− for i nConformers do nCliques→ MCP(Template, Conformer , nMaxCliques); i ←− i for j nCliquesi do rotMat→ Superimpose(Pharmacophore); ←− alignedConformerij rotMat Conformeri; end ←− ∗ end store best nTotalMCP alignments; for i nT otalMCP do → MM Calculate EINT,i(receptor, alignedConformeri); end store best nMM alignments; for i nMM do → qmalign Calculate SEi (temp, alignedConformeri); end finalAlignments CuTieP(); end ←−

114 Table 4-2. Protein-Ligand Data Set.

Receptor PDB-ID Flexible Alignment Carboxypeptidase A 1CBX, 2CTC, 3CPA Glycogen Phosphorylase 1GPY, 3GPB, 4GPB, 5GPB Immunoglobin 1DBB, 1DBJ, 1DBK, 1DBM, 2DBL Streptavidin 1SRE, 1SRF, 1SRG, 1SRH, 1SRI, 1SRJ Dihydrofolate Reductase 1DHF, 4DFR Trypsin 1PPH, 1TNH, 1TNI, 1TNJ, 1TNK, 1TNL, 3PTB Estrogen Receptor 1ERR, 3ERT Peroxisome Proliferator- 1FM6, 1FM9 Activated Receptor γ Human Carbonic Anhydrase II 1A42, 1BN1, 1BN3, 1BN4, 1BNM, 1BNN, 1BNQ, 1BNT, 1BNU, 1BNV, 1BNW, 1CIL, 1CIM, 1CIN, 1CNX, 1EOU, 1G1D, 1G52, 1G53, 1G54, 1I8Z, 1I90, 1I91, 1IF4, 1IF5, 1IF6, 1IF7, 1IF8, 1IF9, 1KWQ, 1KWR, 1OKL, 1OKN, 1OQ5, 1TTM, 1XPZ, 1XQ0, 1YDA, 1ZE8 Rigid Alignment Thrombin 1DWC, 1DWD Elastase 1ELA, 1ELB, 1ELC, 1ELD, 1ELE Thermolysin 1TLP, 1TMN, 2TMN, 3TMN, 4TLN, 4TMN, 5TMN

Each complex structure was broken into two, a receptor and ligand, with all co-factors and water molecules removed. The ligand’s relative orientation in space was retained. The Labute algorithm [142] within MTK++ was used to perceive atom hybridizations, bond orders, and formal charges of the ligands. Hydrogen atoms were added to both the receptor and ligand structures using the protonate functions also build into MTK++. GAFF [160] atom types and CM2 [123] charges were assigned to the ligands using the antechamber [166] and DivCon [112] programs. The locations of the rotatable bonds and the functional groups of each ligand were determined using the tools within MTK++ as described in Ch. 3. Each functional group within the MTK++ package contains pharmacophoric information required for the clique detection algorithm above to generate trial alignments. The atom and bond types, charges, and feature information is stored in xml format for use with the CuTieP program.

115 To determine the performance of an alignment approach requires the definition of a reference state. For each receptor example the structures therein were taken in turn and all other complexes superimposed onto it by minimizing the RMSD between the peptide backbone atoms. Thus for each receptor a set of ligands was aligned into its active site

which describes their reference state or ideal alignment. Each pair of ligands in all receptor classes was superimposed onto one another, except for HCA II where 1A42 was only used as a target. After each alignment the root mean squared deviation (RMSD) between the query structure and the reference state was calculated. Within the calculation of the RMSD an atom type correspondence between

atoms of the aligned query and ideal structures was determined. This procedure prevented artificially high RMSD values due to automorphisms as described in Ch. 3.9. And so this procedure results in an NxN matrix of alignments, where N is the number of complexes for each receptor class. Nevertheless these matrices are not symmetric as the results depend on which ligand is used as the target. The lowest RMSD value from the top ten alignments is report in the tables and figures below. Also shown in the output tables are the number of conformers sampled, stored, and RMSD from the bioactive conformation. The ClqScore and SEScore values are also plotted using so-called levelplot’s using the R program [235] to graphically view the alignment results. The overall performance of CuTieP to reproduce binding geometries of ligands in complex with a receptor is shown in Table 4-3. It is considered a correct alignment when the RMSD value between the pose and the reference structure is below 1.5A˚ [196], while under 2.5A˚ RMSD represents the correct orientation and conformer of the query, and

finally RMSD above 2.5A˚ are considered to be misaligned. A total of 219 alignments were carried out and in 48.9% of the cases a satisfactory result was achieved. 64.8% of the alignments were in the correct orientation; however, 35.2% would be regarded as misaligned. Each receptor is outlined in more detail below.

116 Table 4-3. Statistics of CuTieP Performance. N < 1.5A˚ < 2.5A˚ < 3.5A˚ Total Carboxypeptidase A 2 3 6 6 Glycogen Phosphorylase 7 8 11 12 Immunoglobin 7 13 15 20 Streptavidin 25 27 30 30 Dihydrofolate Reductase 0 1 1 2 Trypsin 12 20 28 42 Estrogen Receptor 1 1 2 2 PPARγ 0 0 2 2 Human Carbonic Anhydrase II 22 33 37 39 Thrombin 2 2 2 2 Elastase 8 8 11 20 Thermolysin 21 26 31 42 Total 107 142 174 219 % 48.9 64.8 79.5

4.3.2 Carboxypeptidase A

Three Carboxypeptidase A ligand complexes were used in this study including L-benzyl-succinate (1cbx, Fig. 4-1(a)), L-phenyl-lactate (2ctc, Fig. 4-1(b)), and glycyl-L- tryrosine (3cpa, Fig. 4-1(c)). The ligand structures of 2ctc and 3cpa when 2CTC and

3CPA are aligned onto 1CBX are shown in Fig. 4-1(d) and a key point to note is that the phenyl moieties do not overlap. The carboxylates (closer to the phenyl group) of all three structures are aligned onto one another which forms a strong intermolecular interaction with an Arginine residue of the protein, while the tail group binds a Zinc atom.

The conformation analysis and pair alignments are outlined in Table 4-4. In all three cases conformers were generated which resemble the bioactive conformation (0.5A˚ RMSD from the bioactive conformation) [223]. The conformational searching of 1cbx is further described in figures 4-2(a), 4-2(b), and 4-2(c). The plot of MM energy versus RMSD from

the bioactive conformation indicates that indeed the bound geometry is not the global energy minimum using GAFF. The Euclidean Distance (ED) is the RMSD of torsional space between conformers and the bioactive structure. The plot of ED versus MM energy shows the conformers are clustered between 10◦ and 25◦ ED while the variation of the the

117 RMSD versus SD is shown in Fig. 4-2(c). The best RMSD of the top 10 poses between the query structure and their associated reference alignment are also outlined in Table 4-4. When 1cbx is used as the reference or query poor alignments are generated when using the SE scoring function, though when either 2ctc or 3cpa are used RMSDs of below

1.5A˚ are found. This is a disappointing result as the levelplot diagram of the ClqScores for this receptor, shown in Fig. 4-3(a), generated trial alignment under 1.0A˚ but the SE score function did not score these the highest as shown in Fig. 4-3(b). The most probable reason for the discrepancies can be attributed to the fact that the phenyl group is the not the most important pharmacophore feature in the set. And so since the SE

scoring function is a frontier orbital approach the greatest similarity between pairs of these molecules would be obtained where the phenyl groups overlap.

Table 4-4. Carboxypeptidase A Ligand Alignments. Query 1cbx 2ctc 3cpa Rotatable Bonds 5 3 6 Conformers Sampled 1944 108 62208 Conformers Stored 126 70 6767 Conformers < 1A˚ 11 30 807 Conformers < 2A˚ 115 40 3201 1cbx 0.00 2.21 2.74 Target 2ctc 2.38 0.00 1.01 3cpa 2.05 1.24 0.00

4.3.3 Glycogen Phosphorylase

The ligands alpha-d-glucose-6-phosphate (1gpy, Fig. 4-4(a)), alpha-*d-glucose-1- phosphate (3gpb, Fig. 4-4(b)), 2-fluoro-2-deoxy-alpha-*d-glucose-1-phosphate (4gpb, Fig. 4-4(c)), and alpha-*d-glucose-1-methylene-phosphate (5gpb, Fig. 4-4(d)) which bind to

Glycogen Phosphorylase were used in this study. The conformational analysis and SE pair alignment scores are outlined in Table 4-5. Meaningful alignments when 1gpy is used as a target or query cannot be generated as this ligand binds in a different pocket as shown in Fig. 4-4(e); however, these alignments were carried out to generate a sense of the

118 OH

OH

OH HN O

O OH O OH

O HO O NH2 (a) 1cbx (b) 2ctc (c) 3cpa

(d) 2CTC and 3CPA aligned onto 1CBX

Figure 4-1. Carboxypeptidase A Ligands.

predicted error in this approach. All other pair alignments produced excellent agreement using both the ClqScore and SEScore with the observed structural superimpositions from

the PDB as highlighted in figures 4-5(a) and 4-5(b). 4.3.4 Immunoglobin

The ligands progesterone (1dbb, Fig. 4-6(a)), aetiocholanolone (1dbj, Fig. 4-6(b)),

5-beta-androstane-3,17-dione (1dbk, Fig. 4-6(c)), progesterone-11-alpha-ol-hemisuccinate (1dbm, Fig. 4-6(d)), and 5-alpha-pregnane-3-beta-ol-hemisuccinate (2dbl, Fig. 4-6(e)) which bind to Immunoglobin were used in this study. The ligands 1dbj and 1dbk are cholic acid derivatives while 1dbm, 2dbl, and 1dbb are steroidal molecules. The ideal

119 Energy (kcal/mol) Energy (kcal/mol) 80 85 90 95 100 105 80 85 90 95 100 105

0.0 0.5 1.0 1.5 0 5 10 15 20 25 ° RMSD (A) Euclidean Distance (Deg)

(a) MM Energy vs RMSD (b) MM Energy vs Elucidean Distance ) ° A RMSD ( 0.0 0.5 1.0 1.5

0 5 10 15 20 25

Euclidean Distance (Deg)

(c) Elucidean Distance vs RMSD

Figure 4-2. 1CBX Conformer Analysis.

Table 4-5. Glycogen Phosphorylase Ligand Alignments. Query 1gpy 3gpb 4gpb 5gpb Rotatable Bonds 3 3 3 3 Conformers Sampled 27 27 27 27 Conformers Stored 16 16 19 14 Conformers < 1A˚ 10 12 14 10 Conformers < 2A˚ 6 4 5 4 1gpy 0.00 3.66 3.19 3.92 3gpb 1.43 0.00 0.38 0.84 Target 4gpb 1.52 0.42 0.00 0.64 5gpb 3.18 0.82 0.52 0.00

120 9 9

8 8 3CPA 3CPA 7 7

6 6

5 5 2CTC 2CTC 4 4 Query Query

3 3

2 2 1CBX 1CBX 1 1

0 0

1CBX 2CTC 3CPA 1CBX 2CTC 3CPA Reference Reference

(a) ClqScore (b) SEScore

Figure 4-3. Carboxypeptidase A Alignment Results. Reference or target structures are on the x-axis with the query structures on the y-axis. The best RMSD value of each pose alignment compared to the ideal alignment are given with an RMSD scale in A.˚ alignments of 1dbj, 1dbk, 1dbm, and 2dbl for the 1dbb target structure are shown in Fig. 4-6(f). No ring flexibility was allowed in this study and so only the bioactive conformation

of 1dbj and 1dbk were used as indicated in Table 4-6. This receptor produced the largest difference between the top poses predicted by ClqScore and SEScore as shown in Fig. 4-7(a) and 4-7(b) respectively. The alignment of 1dbm and 2dbp onto 1dbj and 1dbk produce large RMSD values using the SEScore even though the clique detection algorithm

produces an alignment which visually appears more reasonable. In fact the top poses are aligned perpendicular to the plane of the target structures which could be put down to the differences in the stereochemistry of the two subsets. The alignments of ligands within the same subset produce poses that are in good agreement with the reference states. 4.3.5 Streptavidin

Streptavidin ligands including 2-((4’-hydroxyphenyl)-azo) benzoic acid (1sre, Fig.4-8(a)), 2-((3’-tertbutyl-4’-hydroxyphenyl)azo) benzoic acid (1srf, Fig.4-8(b)), 2-((3’-methyl-4’-hydroxyphenyl)azo) benzoic acid (1srg, Fig. 4-8(c)), 2-((3’, 5’-dimethoxy-4’

-hydroxyphenyl)azo) benzoic acid (1srh, Fig.4-8(d)), 2-((3’, 5’-dimethyl- 4’-hydroxyphenyl)azo)

121 OH O P O OH O HO O O O OH HO P

OH HO OH HO OH

OH OH (a) 1gpy (b) 3gpb

OH OH O O P P O OH O OH

O O HO HO

HO F HO OH

OH OH (c) 4gpb (d) 5gpb

(e) 3GPB, 4GPB and 5GPB aligned onto 1GPY

Figure 4-4. Glycogen Phosphorylase Ligands.

benzoic acid (1sri, Fig. 4-8(e)), and 2-((4’-hydroxynaphthyl)-azo) benzoic acid (1srj, Fig. 4-8(f)) were used in the validation of CuTieP. All ligands contain the azo-benzoic acid core and vary at the iminophenol group as shown in Fig. 4-8(g). All ideal alignments were

predicted with good agreement using both the ClqScore and the SEScore except for the query 1srf and target 1sri as shown in Table 4-7 and figures 4-9(a) and 4-9(b). This high RMSD can be associated to the fact that there are two valid alignments available in a

122 9 9

8 8 5GPB 5GPB 7 7

6 6 4GPB 4GPB 5 5

4 4 Query Query 3GPB 3GPB 3 3

2 2 1GPY 1GPY 1 1

0 0

1GPY 3GPB 4GPB 5GPB 1GPY 3GPB 4GPB 5GPB Reference Reference

(a) ClqScore (b) SEScore

Figure 4-5. Glycogen Phosphorylase Alignment Results. Reference or target structures are on the x-axis with the query structures on the y-axis. The best RMSD value of each pose alignment compared to the ideal alignment are given with an RMSD scale in A.˚

Table 4-6. Immunoglobin Ligand Alignments Query 1dbb 1dbj 1dbk 1dbm 2dbl Rotatable Bonds 100 6 6 Conformers Sampled 12 1 1 93312 93312 Conformers Stored 9 1 1 5000 5000 Conformers < 1A˚ 9 1 1 712 1361 Conformers < 2A˚ 0 0 0 4288 3628 1dbb 0.00 3.65 1.85 0.77 1.30 1dbj 3.24 0.00 0.29 6.41 4.48 Target 1dbk 3.20 0.29 0.00 6.28 4.45 1dbm 0.41 1.59 1.59 0.00 0.86 2dbl 0.43 1.65 1.69 1.52 0.00

LBDD sense; the t-butyl group of 1srf may be placed on to either methyl moiety of 1sri.

4.3.6 Dihydrofolate Reductase

The ligand molecules dihydrofolate (1dhf, Fig. 4-10(a)) and methotrexate (4dfr,

Fig. 4-10(b)) which bind to Dihydrofolate Reductase were used to validate the CuTieP approach with the alignment of 4DFR onto 1DHF shown in Fig 4-10(c). Both ligand structures are quite large, have many rotatable bonds and thus a large number of

123 O O O

H H H

H H H H H H HO O O H H (a) 1dbb (b) 1dbj (c) 1dbk

O

OH

O H O O

O H H

H O H OH H H O

O O (d) 1dbm (e) 2dbl

(f) 1DBJ, 1DBK, 1DBM and 2DBL aligned onto 1DBB

Figure 4-6. Immunoglobin Ligands

124 9 9

2DBL 8 2DBL 8

7 7 1DBM 1DBM 6 6

5 5 1DBK 1DBK 4 4 Query Query

3 3 1DBJ 1DBJ 2 2

1DBB 1 1DBB 1

0 0

1DBB 1DBJ 1DBK 1DBM 2DBL 1DBB 1DBJ 1DBK 1DBM 2DBL Reference Reference

(a) ClqScore (b) SEScore

Figure 4-7. Immunoglobin Alignment Results. Reference or target structures are on the x-axis with the query structures on the y-axis. The best RMSD value of each pose alignment compared to the ideal alignment are given with an RMSD scale in A.˚

Table 4-7. Streptavidin Ligand Alignments Query 1sre 1srf 1srg 1srh 1sri 1srj Rotatable Bonds 343 5 33 Conformers Sampled 108 1296 108 155552 108 108 Conformers Stored 55 117 55 2149 55 28 Conformers < 1A˚ 36 49 28 661 26 28 Conformers < 2A˚ 19 36 27 771 26 0 1sre 0.00 2.50 0.27 0.85 1.02 0.44 1srf 0.66 0.00 0.48 0.57 1.01 0.50 Target 1srg 0.27 2.59 0.00 0.52 0.76 0.39 1srh 0.46 2.75 0.36 0.00 0.72 0.86 1sri 0.90 3.01 0.75 0.77 0.00 1.08 1srj 0.46 2.30 0.40 1.07 1.11 0.00

conformers were sampled as outlined in Table 4-8. Both the clique detection algorithm and the semiempirical scoring function produced poor results for this receptor compared to the results of the FLEXS program where an average RMSD of 1.53A˚ was predicted. 4.3.7 Trypsin

The following ligands of Tyrypsin: 4-fluorobenzylamine (1tnh, Fig. 4-11(a)), 4-phenylbutylamine (1tni, Fig. 4-11(b)), 2-phenylethylamine (1tnj, Fig. 4-11(c)),

125 OH OH OH

N N N

N N N

OH OH OH

O O O (a) 1sre (b) 1srf (c) 1srg

O

OH OH O

OH N N O N N

N N

OH OH

O O OH (d) 1srh (e) 1sri (f) 1srj

(g) 1SRF, 1SRG, 1SRH, 1SRI and 1SRJ aligned onto 1SRE

Figure 4-8. Streptavidin Ligands

126 9 9

1SRJ 8 1SRJ 8

7 7 1SRI 1SRI 6 6

1SRH 5 1SRH 5

4 4 Query 1SRG Query 1SRG 3 3 1SRF 1SRF 2 2

1SRE 1 1SRE 1 0 0

1SRI 1SRI 1SRE 1SRF 1SRG 1SRH 1SRJ 1SRE 1SRF 1SRG 1SRH 1SRJ Reference Reference

(a) ClqScore (b) SEScore

Figure 4-9. Streptavidin Alignment Results. Reference or target structures are on the x-axis with the query structures on the y-axis. The best RMSD value of each pose alignment compared to the ideal alignment are given with an RMSD scale in A.˚

Table 4-8. Dihydrofolate Reductase Ligand Alignments. ClqScores are shown in parenthesis. Query 1dhf 4dfr Rotatable Bonds 10 10 Conformers Sampled 629856 629856 Conformers Stored 5000 5000 Conformers < 1A˚ 7 5 Conformers < 2A˚ 796 550 1dhf 0.00 3.28(3.29) Target 4dfr 2.43(2.10) 0.00

3-phenylpropylamine (1tnk, Fig. 4-11(d)), trans-2-phenylcyclopropylamine (1tnl, Fig. 4-11(e)), and benzamidine (3ptb, Fig. 4-11(f)), and m-amidino-nalpha-tosylated piperidide (1pph, Fig. 4-11(g)) were used in this study. The primary difference between 1tnh, 1tni, 1tnj, 1tnk, and 1tnl lies in the distance between the primary amine and the phenyl group,

while 3ptb is a substructure of the largest molecule in this set 1pph. There are some notable differences between the best poses predicted by both scoring functions. When 1tnh is used as the target structure the SEScore predicts more reasonable poses than that of the clique detection algorithm, also the alignment of 1tnh onto 3ptb is predicted in closer

127 H N N NH2

NH N N NH2 N

NH O N N

O N NH2 O

NH O HO O

NH HO

O OH O OH (a) 1dhf (b) 4dfr

(c) 4DFR aligned onto 1DHF

Figure 4-10. Dihydrofolatreductase Ligands. agreement with the reference state. The alignment of 1pph onto all other targets results in poor predicted alignments due to its size.

4.3.8 Estrogen Receptor

Two Estrogen Receptor (ER) complexes were used including 1ERR and 3ERT where raloxifene (1err, Fig. 4-13(a)) and 4-hydroxytamoxifen (3ert, Fig. 4-13(b)) bind respectively. The alignment of the 3ERT complex onto the 1ERR structure is shown in Fig. 4-13(c). The ER set was used in this study because it is a key target for breast cancer

128 F

NH3 NH3 NH3 NH3

(a) 1tnh (b) 1tni (c) 1tnj (d) 1tnk

O H S N N O O H

NH3 H2N NH2 H2N NH2 (e) 1tnl (f) 3ptb (g) 1pph

(h) 1TNH, 1TNI, 1TNJ, 1TNK, 1TNL, 3PTB Aligned onto 1PPH

Figure 4-11. Trypsin Inhibitors.

129 Table 4-9. Trypsin Ligand Alignments Query 1pph 1tnh 1tni 1tnj 1tnk 1tnl 3ptb Rotatable Bonds 8 142311 Conformers Sampled 559872 12 162 18 54 6 3 Conformers Stored 1000 12 90 17 36 6 4 Conformers < 1A˚ 2 12 34 14 26 6 4 Conformers < 2A˚ 62 0 56 3 10 0 0 1pph 0.00 0.80 5.43 3.93 9.05 4.06 0.23 1tnh 1.16 0.00 2.79 1.54 1.73 1.09 0.33 1tni 6.75 3.50 0.00 2.41 2.79 2.46 3.38 Target 1tnj 6.82 2.26 2.84 0.00 1.97 0.62 0.94 1tnk 7.14 3.39 2.90 2.64 0.00 0.30 1.56 1tnl 7.36 0.62 3.00 1.40 1.17 0.00 4.20 3ptb 3.81 0.70 2.98 2.13 1.32 4.70 0.00

9 9 3PTB 3PTB 8 8

1TNL 7 1TNL 7

6 6 1TNK 1TNK 5 5 1TNJ 1TNJ 4 4 Query Query 1TNI 1TNI 3 3

1TNH 2 1TNH 2

1 1 1PPH 1PPH 0 0

1PPH 1TNH 1TNI 1TNJ 1TNK 1TNL 3PTB 1PPH 1TNH 1TNI 1TNJ 1TNK 1TNL 3PTB Reference Reference

(a) ClqScore (b) SEScore

Figure 4-12. Trypsin Alignment Results. Reference or target structures are on the x-axis with the query structures on the y-axis. The best RMSD value of each pose alignment compared to the ideal alignment are given with an RMSD scale in A.˚

130 HO

S OH HO

O

O O

N N

(a) 1err (b) 3ert

(c) 3ERT Aligned onto 1ERR

Figure 4-13. Estrogen Receptor Ligands.

drug discovery [31]. SEScore out performs ClqScore for this receptor as shown in Table

4-10. The alignment of 3ert onto 1err is predicted with an RMSD value of 1.12A˚ using SEScore but 1err onto 3ert can be considered misaligned. 4.3.9 Peroxisome Proliferator-Activated Receptorγ

The rosiglitazone (1fm6, Fig. 4-14(a)) and GI262570 (1fm9, Fig. 4-14(b)) ligands of Peroxisome Proliferator-Activated Receptorγ (PPARγ) were used in this study with the ideal structures shown in Fig. 4-14(c). This receptor was including because it has been directly linked to type 2 diabetes, cardiovascular diseases, and obesity [236–240] and so

131 Table 4-10. Estrogen Receptor Ligand Alignments. ClqScores are shown in parenthesis. Query 1err 3ert Rotatable Bonds 6 8 Conformers Sampled 11664 419904 Conformers Stored 567 1015 Conformers < 1A˚ 16 132 Conformers < 2A˚ 238 881 1err 0.00 1.12(2.15) Target 3ert 2.82(3.47) 0.00 poses as a key target for drug research. The ClqScore predicts the correct pose for 1fm6 aligned onto 1fm9 but the opposite alignment gave poor results. SEScore was on average better than the ClqScore but the two alignment could be considered as misaligned.

Table 4-11. PPARγ Ligand Alignments. Query 1fm6 1fm9 Rotatable Bonds 4 8 Conformers Sampled 324 419904 Conformers Stored 271 2000 Conformers < 1A˚ 11 3 Conformers < 2A˚ 101 88 1fm6 0.00 2.69(5.09) Target 1fm9 2.59(1.39) 0.00

4.3.10 Human Carbonic Anhydrase II

The 40 Human Carbonic Anhydrase II (HCA II) ligands used in this study are shown in Table 4-12. HCA II is a Zinc metalloprotein as described in Ch. 2.14 and all inhibitors bind the Zn ion through the sulfonamide group. The HCA family of proteins have been extensively studied using X-ray crystallography because they pose as important targets for drug discovery with drugs utilized as antiglaucoma, anticonvulsant, antirolithic, antiepileptic, and anticaner agents [241]. This set was also chosen as it represents a data set which could be used within the SE-COMBINE approach to predict binding free energies and design new drugs. The ideal states for 39 ligands for the 1a42 inhibitor are shown in Fig. 4-15. In total 22 ligands were aligned in a satisfactory manner, 11 were

132 O

O OH

N HN O O O

NH N N S O O

(a) 1fm6 (b) 1fm9

(c) 1FM6 Aligned onto 1FM9

Figure 4-14. Peroxisome Proliferator-Activated Receptor γ Agonists. aligned with the correct orientation, while only 6 ligands can be regarded as misaligned whereas the ClqScore misaligned 12 structures as shown in Table 4-13.

133 Table 4-12. 40 Human Carbonic Anhydrase II Inhibitors.

HCA II Inhibitors. PDB ID Structure Ref Structure Ref

O O O O O S S N O O S S O N O S O S NH2 NH 2 N HN

1a42 [25] 1i8z O [242]

O O H O O S S N N O S O S O S O O S NH2 1bn1 NH2 [25] 1i90 NH2 [242]

O O

O S S N OH O S

O O NH2

O S S N N O O S 1bn3 NH2 [25] 1i91 O [242]

O

H O N S O S O O O S O S F 1bn4 NH2 [25] 1if4 NH2

O O F O S S N O O O S O S NH2 NH2 HN 1bnm [25] 1if5 F

F O O O O S S N O O S O S NH2 1bnn NH2 [25] 1if6 F

O O O O O S O S O S N O S HN NH2

NH2 HN N 1bnq [25] 1if7 [243]

O O

O O O S HN O S S NH2 N O O S N NH2 1bnt OH [25] 1if8 [243]

Continued on next page.

134 Table 4-12. (continued)

PDB ID Structure Ref Structure Ref

O O O O O S O S S S N HN NH2 O S NH

NH2 1bnu OH [25] 1if9

O O O N O O S S N O O O S O S N NH2 NH2 HN 1bnv [25] 1kwq O [244]

S O H O N S S O S O O O S O S N 1bnw NH2 [25] 1kwr NH2 [244]

O O

O S S O S N NH2 O

HN O S 1cil [245] 1okl NH2 [244]

O O

O S S O O O S O S H2N NH2 HN NH2 1cim NH2 [245] 1okm

O O

O S S SH O S O O O S HN NH2 NH HN 1cin HN [245] 1okn 2

F F O N F O S N

NH2 O O NH2 O O S HN NH2 1cnx O [244] 1oq5 [246]

O O O O O O S O O O O S O O O NH2 O NH2 O 1eou [247] 1ttm [248]

Continued on next page.

135 Table 4-12. (continued)

PDB ID Structure Ref Structure Ref

O N O S O

NH2 O O O S HN F N NH2 N

N 1g1d [249] 1xpz N [248]

O N O S O Br

NH2 O O O S HN F N NH2 N

F N 1g52 [249] 1xq0 N [248]

O O O S HN F H NH2 N O S O S F N O N 1g53 [249] 1yda NH2 [244]

O O O S HN F NH2 O O S F F N NH2

1g54 F F [249] 1ze8 [250]

HCA II Inhibitors.

4.3.11 Thrombin

Two inhibitors of Thrombin including 1dwe, (Fig. 4-16(a)) and 1dwd (Fig. 4-16(b))

were used with the 1DWE aligned onto 1DWD shown in Fig. 4-16(c). The ClqScore provide the same results as the SEScore with an average RMSD of 1.12A;˚ however no conformational searching was performed. 4.3.12 Elastase

Five tripeptide ligands of Elastase were used in this study including 1ela (Fig. 4-17(a)), 1elb (Fig. 4-17(b)), 1elc (Fig. 4-17(c)), 1eld (Fig. 4-17(d)), and 1ele (Fig. 4-17(e)) where 1ela, 1eld, and 1ele occupy different pockets of the serine protease. The

136 (a) 1BN3, 1BNM, 1BNN, 1BNQ, 1BNT, 1BNU, (b) 1BN1, 1BN4, 1BNW, 1IF4, 1IF5, 1IF6, 1NBV, 1I8Z, 1I90, 1I91, 1CIL, 1CIM, and 1CIN 1KWQ, 1KWR, 1OKL, and 1YDA

(c) 1G1D, 1G52, 1G53, 1G54, 1IF7, 1IF8, 1IF9, (d) 1EOU, 1OQ5, 1TTM, 1XPZ, and 1XQ0 1CNX, 1OKM, 1OKN, and 1ZE8

Figure 4-15. HCA II Ligands Aligned onto the 1A42 Structure.

137 Table 4-13. Human Carbonic Anhydrase II Results. Rotatable Conformers Bonds Sampled Stored < 1.0A˚ < 2.0A˚ ClqScore SE 1a42 7 8748 1bn1 5 15552 3545 68 3208 4.23 1.70 1bn3 3 864 133 60 73 0.94 0.50 1bn4 6 186624 5000 179 4818 5.87 3.70 1bnm 4 5184 702 309 393 0.89 0.53 1bnn 3 1726 458 188 270 2.65 1.90 1bnq 6 2916 259 80 179 0.10 0.11 1bnt 3 1728 481 331 150 1.09 0.29 1bnu 3 432 60 8 52 1.41 1.67 1bnv 4 5184 784 344 440 2.93 1.09 1bnw 5 15552 3775 194 2978 1.63 2.84 1cil 3 108 25 19 6 0.95 0.90 1cim 1 12 7 7 0 0.58 0.58 1cin 2 36 7 6 1 0.25 0.25 1cnx 11 944784 5000 119 4188 1.41 1.62 1eou 3 27 10 7 3 1.29 1.29 1g1d 5 20736 4311 427 3860 1.57 2.04 1g52 5 20736 4470 426 3766 1.98 3.39 1g53 5 20736 3542 323 3188 1.46 1.57 1g54 5 20736 4238 606 3290 2.72 1.57 1i8z 5 15552 803 103 438 3.44 1.10 1i90 5 972 79 35 44 1.09 1.05 1i91 4 1296 64 17 47 1.37 0.79 1if4 3 432 7 7 0 0.27 0.24 1if5 1 12 7 7 0 0.41 0.34 1if6 3 432 7 7 0 0.52 0.39 1if7 7 186624 5000 107 3279 1.55 1.45 1if8 7 186624 5000 221 3243 1.61 2.04 1if9 7 93312 5000 2 3503 3.97 2.82 1kwq 3 432 8 2 6 2.47 1.22 1kwr 1 12 7 7 0 1.03 1.07 1okl 2 72 4 1 3 1.21 1.25 1okm 7 23328 5000 518 4482 1.25 0.75 1okn 9 104976 5000 196 4696 1.91 2.27 1oq5 2 144 73 66 7 4.45 1.77 1ttm 2 36 37 31 6 1.57 1.23 1xpz 7 186624 5000 9 2657 4.14 3.78 1xq0 7 186624 5000 3 2019 3.79 2.82 1yda 2 72 37 17 20 2.03 1.63 1ze8 4 1296 480 208 127 2.77 1.69

138 O N O H N S N H O O H2N N

O H N NH2 O N H NH HN NH2 (a) 1dwe (b) 1dwd

(c) 1DWE Aligned onto 1DWD

Figure 4-16. Thrombin Inhibitors.

Table 4-14. Thrombin Ligand Alignments Query 1dwc 1dwd 1dwc 0.00 1.09 Target 1dwd 1.15 0.00

139 alignment of 1ELB, 1ELC, 1ELD, and 1ELE onto 1ELA is shown in Fig. 4-17(f). Only the bioactive conformation of each ligand was used in this section of the validation of the CuTieP approach. All pair alignments were evaluated; however, no reasonable superposition of 1elb and 1elc onto 1ela, 1eld, or 1ele can be expected and the results in

Table 4-15 and figures 4-18(a) and 4-18(b) confirmed this. The program FLEXS was also unable to successfully align these pairs where the authors explain that the volume overlap is below 60% [234]. All other pairs were successfully aligned using both the ClqScore and SEScore scores.

Table 4-15. Elastase Ligand Alignments. Query 1ela 1elb 1elc 1eld 1ele 1ela 0.00 3.17 7.02 0.48 0.28 1elb 3.02 0.00 1.38 3.86 3.29 Target 1elc 5.03 0.98 0.00 5.62 4.18 1eld 0.50 3.94 5.87 0.00 0.48 1ele 0.26 3.21 5.74 0.47 0.00

4.3.13 Thermolysin

Pairs of seven inhibitors of Thermolysin from the PDB were aligned onto one another in this study. These included 1tlp (Fig. 4-19(a)), 1tmn (Fig. 4-19(b)), 2tmn (Fig. 4-19(c)),

3tmn (Fig. 4-19(d)), 4tln (Fig. 4-19(e)), 4tmn (Fig. 4-19(f)), 5tmn (Fig. 4-19(g)) which vary in size and charge considerably. Fig. 4-19(h) shows the ligands structures when 1TMN, 2TMN, 3TMN, 4TLN, 4TMN, 5TLN, and 5TMN are aligned onto 1TLP. The smaller ligands 2tmn, 3tmn, and 4tln were allowed to change conformation with the rest

kept in their bioactive conformation as outlined in Table 4-16. When the smaller ligands are used as the target structure the best poses generated with both the clique detection algorithm and the semiempirical score function are poor but this can be expected to occur.

140 F F F F F F F F F

O N HN O NH HN O NH HN O O O NH NH NH O O O

NH2 NH2 NH2 (a) 1ela (b) 1elb (c) 1elc

F F F F F F

O NH HN O NH HN O O NH NH O O F F F F

F F (d) 1eld (e) 1ele

(f) 1ELB, 1ELC, 1ELD and 1ELE aligned onto 1ELA

Figure 4-17. Elastase Ligands.

141 9 9

1ELE 8 1ELE 8

7 7 1ELD 1ELD 6 6

5 5 1ELC 1ELC 4 4 Query Query

3 3 1ELB 1ELB 2 2

1ELA 1 1ELA 1

0 0

1ELA 1ELB 1ELC 1ELD 1ELE 1ELA 1ELB 1ELC 1ELD 1ELE Reference Reference

(a) ClqScore (b) SEScore

Figure 4-18. Elastase Alignment Results. Reference or target structures are on the x-axis with the query structures on the y-axis. The best RMSD value of each pose alignment compared to the ideal alignment are given with an RMSD scale in A.˚

Table 4-16. Thermolysin Ligand Alignments. Query 1tlp 1tmn 2tmn 3tmn 4tln 4tmn 5tmn Rotatable Bonds 12 14 5 7 4 16 16 Conformers Sampled 1 1 972 186624 216 1 1 Conformers Stored 1 1 68 5000 26 1 1 Conformers < 1A˚ 1 1 12 25 4 1 1 Conformers < 2A˚ 0 0 55 2533 22 0 0 1tlp 0.00 0.81 1.35 2.48 3.68 1.98 1.41 1tmn 0.82 0.00 4.85 3.11 1.72 0.79 1.03 2tmn 0.93 0.79 0.00 3.03 1.80 0.53 0.51 3tmn 1.20 0.40 2.67 0.00 4.13 0.76 0.99 Target 4tln 5.02 6.26 2.89 6.06 0.00 9.43 6.84 4tmn 1.37 0.50 1.42 6.40 3.17 0.00 0.48 5tmn 1.07 0.86 1.60 10.1 3.68 0.44 0.00

142 O O H N P - N O O O H H HN O O O- N N HO OH H HN O OH O O- (a) 1TLP (b) 1tmn

O O H N O H2N P NH2 NH2 N OH H HN O NH O O O- HO (c) 2tmn (d) 3tmn (e) 4tln

O O H H N P N O N O O H H H O O N P N O - N O O H O O O O- (f) 4tmn (g) 5tmn

(h) 1TMN, 2TMN, 3TMN, 4TLN, 4TMN, 5TLN, 5TMN Aligned onto 1TLP

Figure 4-19. Thermolysin Inhibitors.

143 9 9 5TMN 5TMN 8 8

4TMN 7 4TMN 7

6 6 4TLN 4TLN 5 5 3TMN 3TMN 4 4 Query Query 2TMN 2TMN 3 3

1TMN 2 1TMN 2

1 1 1TLP 1TLP 0 0

1TLP 1TMN 2TMN 3TMN 4TLN 4TMN 5TMN 1TLP 1TMN 2TMN 3TMN 4TLN 4TMN 5TMN Reference Reference

(a) ClqScore (b) SEScore

Figure 4-20. Thermolysin Alignment Results. Reference or target structures are on the x-axis with the query structures on the y-axis. The best RMSD value of each pose alignment compared to the ideal alignment are given with an RMSD scale in A.˚

4.4 Conclusions

This study presents the first large scale validation of a semiflexible alignment approach using a semiempirical scoring function. Over 80 complexes and 219 unique alignments were considered where the observed ligand binding geometry was predicted with 49% accuracy. Though the percentage of successful alignment using this method is not as high as FLEXS (60%), it provides an estimate of how physics-based techniques can perform against their empirical counterparts. Physics-based methods provide a more theoretically satisfying approach to molecular alignment. The only parameters used in the approach are those which appear in the SE

Hamiltonian. No training set was used to fit a set of parameters and so it is predicted that this method would not fail where other empirical approaches do due to transferability. Speed is also an important property of molecular alignment algorithms. Empirical methods can often flexibly align molecular structures in times ranging from seconds to minutes. The current implementation of the CuTieP approach takes in the order of

minutes to a few hours depending on the dConf, nMaxCLiques, nT otalMCP , and

144 nMM parameters used. Though considering that this method was designed for use with the SE-COMBINE approach where an SE interaction energy evaluation would be much more expensive than the CuTieP alignment and so speed is not as important than obtaining the correct pose. Numerous techniques may be employed to increase the efficiency of the CuTieP method which include tree pruning during the conformational searching or using the power of large computer clusters or distributed computing since the method is trivially parallelizable.

145 CHAPTER 5 METAL CLUSTER MOLECULAR MECHANICS PARAMETERIZATION 5.1 Introduction

There are currently 52550 structures in the Protein Data Bank (PDB) [15] and searching for “metal” results in over 18,000 hits with the break down shown in table 5-1. Metal ions play a vital role in protein function, structure, and stability, with zinc, copper, and iron playing the biggest role as described in Ch. 2.14.

Table 5-1. Metal Ions in the Protein Data Bank (Accessed on April 5th 2007). Metal Hits Metal Hits Metal Hits Na 2149 V 12 Pd 1 Mg 3467 Cr 7 Ag 9 K 632 Mn 984 Cd 361 Ca 3601 Fe 2022 Ir 6 Co 340 Pt 62 Ni 310 Au 28 Cu 589 Hg 323 Zn 3427 Total = 18330

It is desirable to model metalloprotein systems using MM models because one can carry out simulations to address important structure/function and dynamics

questions that are not currently attainable using QM and QM/MM based methods due to unavailability of parameters or system size. There are a number of approaches to incorporating metal ions into FFs. The Bonded Model defines bonds, angles, and torsion’s between the metal ion and its ligand which

are added to the FF plus the van der Waals component of the non-bonded function. Hancock [251] used this approach to study systems including Copper and Nickel. The Bonded plus electrostatics Model defines bonds and angles between the metal ion and its ligand as well as electrostatic potential (ESP) charges (Fig. 5-1(a)) [252]. This

method attempts to define the correct electrostatic representation of the metal active site as assigning a plus two charge to a divalent metal ion would not describe reality though formally correct. The partial atomic charges can be calculated using the RESP

146 approach [253] or the CMX models of Truhlar and Cramer [123]. The bond and angle force constants are derived from experiment or calculated using ab initio or DFT methods while the torsion term has so far been neglected. The Non Bonded Model does not define any extra bonds and places integer charge on the metal ion [254]. Electrostatic

and Lennard-Jones terms describe the interactions. Modifications to this model to include polarization and charge transfer effects have been developed (Fig. 5-1(b)) [255]. The Cationic Dummy Atom Model is related to the non bonded method where it places dummy atoms (cations) to mimic valence electrons around the metal ion [256]. Electrostatic and Lennard-Jones terms between the dummy atoms and ligating residues

describe the metal-ion interactions (Fig. 5-1(c)) [257, 258]. Other methods include those of Vedani et al. which is a compromise between the bonded and non-bonded methods and is implemented in the YETI program [259], the SIBFA of Gresh and co-workers [260, 261] and the Universal Force Field (UFF) of

Goddard and Rappe and co-workers [262–264]. These methods do not use a pairwise additive potential or are not readily available in typical biomolecular modeling packages.

R1 R1 R1

M M M R4 R4 R4 R2 R2 R2 R3 R3 R3 (a) Bonded Model (b) Non-Bonded Model (c) Cationic Dummy Atom Model

Figure 5-1. Three Approaches to Incorporate Metal Atoms into Molecular Mechanics Force Fields. The bonded model defines bonds, angles, and dihedrals between the metal and ligands, while the non-bonded model does not and uses electrostatics and van der Waals to model the interactions. The cationic dummy atom model is a derivative of the non-bonded model where cations are placed near the metal center to mimic valence electrons around the metal.

Carrying out MM modeling or MD simulations of metal containing proteins is a complicated procedure using the bonded plus electrostatics model. Incorporating metals into protein force fields is a convoluted process due to the plethora of QM Hamiltonians,

147 basis sets and charge models to choose from. Also it has generally been carried out by hand without extensive validation for specific metalloproteins. Some of the published force fields for Zinc, Copper, Nickel, Iron, and Platinum containing systems using the bonded plus electrostatics model are listed in Table 5-2. There have been numerous other FFs containing various metals published including ruthenium(II)-polypyridyl [265], cobalt corrinoids [266–269], Staphylococcal Nuclease [270], alcohol dehydrogenase [271–273], and metalloporphyrins [274–278]. Automated procedures for the parameterization of MM functions for inorganic coordination chemistry have been developed over the last number of years by Norrby

and co-workers [279, 280]. Their attempts have focused on generating parameters using experimental, structural data from databases such as the Cambridge Crystallographic Structural Database (CCSD) and quantum mechanical reference data using a version of the MM3 force field [71].

Table 5-2. Published Metalloprotein Force Fields Using the Bonded Plus Electrostatics Model. Metal Protein References Zinc Human Carbonic Anhydrase II [252, 281, 282] Beta-lactamase [283–290] Dinuclear Beta-lactamase [291, 292] Farnesyl Transferase [293] Copper Blue Copper Proteins [145–150] Nickel Urea Amidohydrolase [151–154] Iron Cytochrome P450 [294, 295] Platinum DNA/Cisplatin [296] Copper, Zinc Superoxide Dismutase [158]

5.2 Implementation

The goal of this research was to provide a platform to rapidly build, prototype, and validate MM models of metalloproteins using the bonded plus electrostatics model for the AMBER suite of programs [16]. The bonded plus electrostatics model was chosen over the other approaches as the resulting parameters lend themselves to be readily added to FFs such as those in AMBER [58] and CHARMM [61]. Also the functions used

148 in these programs are pairwise additive meaning there are no cross-terms and are thus easier to parameterize and less computationally expensive. The latter is a key point when considering fully solvated metalloproteins in MD simulations can have many hundreds of thousands of atoms. A computer program, MCPB (Metal Center Parameter Builder),

to generate FF parameters for metalloproteins was developed to this end. MCPB was not build to supersede the approaches developed by Norrby described above but instead to incorporate a realistic bonded and electrostatic model of the metallocenter into the AMBER FF. The nature of these parameters was investigated in a systematic manner with the objective of creating a generalized metal FF within the bonded plus electrostatics

framework. The MCPB program was built using the MTK++ Application Program Interface (API) as described in chapter 3. A complete work flow of MCPB can be seen in Fig. 5-2. The MCPB program carries out the following steps after a structure is downloaded from the PDB. First the program checks whether the structure contains a

transition metal. If the structure does not contain a metal then the program terminates. Otherwise MCPB attempts to determine the primary and secondary ligands of the metal using rules described by Harding [297–302] which will be described in more detail later in this chapter. Once a metal site is found, MCPB creates model structures of the metal’s

first coordination sphere with which ab initio calculations can be performed on to generate

AMBER-like FF parameters. These models include one to generate charges, qi, and another to determine bond, Kr, and angle, Kθ, force constants. The AMBER function includes bond, angle, torsion, improper, van der Waals and electrostatic terms as described in chapter 2.3; however, only bond, angle and electrostatic terms are parameterized under

the assumption made by Loops et al. that dihedral terms can be ignored. Lennard-Jones parameters are also not parameterized here due to the fact that most metals are buried and that van der Waals interactions are not as important as the electrostatics [280]. Lennard-Jones parameters for the most common metal ion in biology were taken from the

literature [303–310]. The methods of incorporating the bond, angle and charge parameters

149 No Get qi, Kr, and Kθ PDB OK? Models Setup OK? QM Calculations

No Metal Found No No

End OK? Test FF OK?

Figure 5-2. MCPB Flow Diagram where a biomolecular structure is downloaded from the PDB and tested whether it contains a transition metal. If the structure contains a metal ion the MCPB program is used to build and test molecular mechanics force field parameters. are outlined below. Once a FF is produced it is tested using minimization techniques to observe its stability. Further tools such as comparing the frequencies from both ab initio and the resulting FF could also be used [311]. 5.2.1 Equilibrium Bond Lengths and Angles

Equilibrium values for bond, req, angles, θeq, can be determined through ab initio calculations or taken directly from the crystal structure in the PDB. There are pros and cons for using values from both methods. Ab initio calculations are generally carried out in the gas phase but solvent effects can be incorporated with PCM but with an added cost. Crystal structures may contain spurious values and may not be representative of all

structures with this bond type. Both techniques of determining the equilibrium bond and angle values will be investigated later in this chapter. 5.2.2 Force Constants

Force constants, Kr and Kθ, are calculated by first creating a model (model 1) of the metal site, adding Hydrogen atoms using the methods described in Ch. 3.5 and then optimizing it in the gas phase. The residues bound to the metal are approximated, for example, cysteine by a thiolate or histidine by a methyl-, to reduce the

computational cost of the minimization. However, all bonds and angles missing from the

150 FF were accounted for. Once a minimum is found the second derivatives are determined. The Cartesian Hessian matrix is shown in Eq. 5–1, which is the second derivative of energy with respect to coordinates. The eigen-analysis of k provides the force constants,

λi and the normal modes,ν ˆi as shown in Eq. 5–2. The interatomic force constant, KAB, between atoms A and B is required to determine the force on atom A by displacing atom B as shown in Eq. 5–3 which is required for a MM function.

∂2E [k] = kij = (5–1) ∂xi∂xj F = [k]ˆν δr = λ νˆ δr (5–2) i − i − i i

δFA = [kAB]δrB (5–3)

From the minimized structure of model 1 the metal-Ligand bond and angle force constants are evaluated. The force constants are converted from Cartesian into internal coordinates using the Gaussian program [312] providing the following keyword (iop(7/33=1)). The MCPB program then reads the resulting internal force constant

matrix and assigns the values to the appropriate bonds and angles using a conversion factor of 627.5095 between Hartree and kcal/mol and 2240.87 between Hartree/Bohr2 and kcal/molA˚2 for bonds. 5.2.3 Point Charges

The atom centered partial charges were derived using the Merz-Singh-Kollman (MK) [126] and the Restrained ElectroStatic Potential (RESP) [127, 313, 314] schemes described in Ch. 2.9 using a second model (model 2) of the metal center. This model included all

atoms of a bound residue which were capped with acetyl (ACE) and N-methylamine (NME) residues. If two ligating residues were less than five residues apart then they were tethered with glycine residues and the chain capped with ACE and NME. Hydrogen atoms were added using the methods described in Ch. 3.5. This model was not allowed to relax

to save computational expense and to keep the crystallographic geometry. The van der Waals radii for the metals used in the MK scheme were taken from the literature. The

151 MK/RESP scheme was favored over other charge model schemes because its ability to adjust the charge of the capped or linking residues to an integer value, thus allowing the formal charge of the cluster to disperse over the metal and the bound ligands. 5.3 Zinc AMBER Force Field

Now with the ability to build and validate metal FFs established the task of generating a generalized FF was initiated. Zinc was chosen as a considerable number of proteins contain that metal as highlighted in Table 5-1, while also being computationally

well behaved. Metalloproteins containing zinc are both structural and functional proteins as described in Ch. 2.14 and in general Zn is four coordinate, sometimes five or six coordinate when multiple ASP/GLU residues or water molecules bind. It was then necessary to determine all Zn environments which exist in proteins. This was carried out using a program called pdbSearcher to analyze all structures currently in the PDB.

pdbSearcher was developed using the API provided by MTK++ as described in Ch. 3. All X-ray crystal structures with a resolution below 3.0 A˚ were extracted from a local mirror of the PDB for further analysis. For each metal site the primary and secondary shell ligands were determined using Harding’s bond cut-off values as shown in Table

5-3 [297–302]. These values were determined from a series of papers describing metal coordination in the CCSD. A donor atom is considered primary coordinated to a metal if it is within the target distance as shown in Table 5-3 plus some tolerance (0.5 A˚ was used). Metal-donor distances lying between the target distance plus the tolerance and the

target distance plus a second tolerance (1.0 A˚ was used) were defined secondary ligands. For example, if a Zn atom is less than 2.53 A˚ from a Histidine ND1 or NE2 atom then it is considered a primary ligand. If it was less than 3.03 A˚ away then that ligand is labeled as secondary, otherwise it is unbound. Once the number of primary and secondary shell ligands were determined, the geometry of the metal centers were evaluated. The coordination states allowed include octahedral, Fig. 5-3(a), Trigonal Bipyramid, Fig. 5-3(b), Square Pyramid, Fig. 5-3(c), tetrahedral, Fig. 5-3(d), square planar, Fig. 5-3(e),

152 and tetrahedral plus a non-bonded contact, Fig. 5-3(f). From Fig. 5-3 we can see that the coordination number alone is not enough to assign a metal geometry. Thus the root mean square deviation (RMSD) of the geometry angles from those in a regular polyhedron were calculated. Equation 5–4 was used to distinguish between square planar and tetrahedral geometries with the ideal angles used in Table 5-4. Likewise, equations 5–5 and 5–6 were used for five and six coordinate metals respectively. The atom indices in Table 5-4 correspond to those atoms in Fig. 5-3. This indexing is useful to differentiate between axial/equatorial and cis/trans ligands. The coordination state with the lowest rms was assigned to the metal and its ligands.

Table 5-3. Metal-Donor Bond Target Lengths in A.˚ The following donor atoms of residues are implied: HOH@O, ASP@OD1/OD2, GLU@OE1/OE2, HIS@ND1/NE2, CYS@SG, MET@SG, SER@O, THR@O, TYR@O and the amino acid backbone carbonyl oxygen atom CRL. If a metal-donor distance is within these target distances plus some tolerance (0.5A)˚ it is considered a primary interaction. Metal HOH ASP/GLU HIS CYS/MET SER/THR TYR CRL Na 2.41 2.41 2.38 Mg 2.07 2.07 2.10 1.87 2.26 K 2.81 2.82 2.74 Ca 2.39 2.36 2.43 2.20 2.36 Mn 2.19 2.15 2.21 2.35 2.25 1.88 2.19 Fe 2.09 2.04 2.16 2.30 2.13 1.93 2.04 Co 2.09 2.05 2.14 2.25 2.10 1.90 2.08 Ni 2.09 2.05 2.14 2.25 2.10 1.90 2.08 Cu 2.13 1.99 2.02 2.15 2.00 1.90 2.04 Zn 2.09 1.99 2.03 2.31 2.14 1.95 2.07

1/2 1 6 δ = (a a )2 (5–4) tet/sqp 6 i − ideal " i=1 # X 1/2 1 10 δ = (a a )2 (5–5) tbp/ttp 10 i − ideal " i=1 # X 1/2 1 15 δ = (a a )2 (5–6) oct 15 i − ideal " i=1 # X

153 1 5 4 3 6 2 (a) Octahedral

1 5 3 5 1 4 4 2 3 2 (b) Trigonal Bipyramid (c) Square Pyramid

1

3 2 4 2 3 1 4 (d) Tetrahedral (e) Square Planar

1 5 4 2 3 (f) Tetrahedral plus Non-Bonded Contact

Figure 5-3. Metal Ligand Geometries Perceived Using Harding’s Rules.

5.3.1 Protein Data Bank Survey of Zinc Containing Proteins

The results of searching the PDB (accessed on the 5th of April 2007) for zinc metalloproteins using the rules above are shown in Fig. 5-4. There are 524 cases of

trigonal bipyramidal (tbp) and 706 cases of square pyramid (trp) and 228 instances of octahedral (oct). 615 metal centers are found as tetrahedral with a non-bonded contact (tnb) and 1372 are ill-defined using the current definitions (unk). 2964 out of 6435 total observations or 46.1% of zinc atoms in protein structures are found to be tetrahedral (tet),

and thus the results and discussion will focus on them. The most common Zn coordinating

154 Table 5-4. Ideal Angles Used to Calculate Root Mean Square Deviations for Tetrahedral, Square Planar, Trigonal Bipyramidal, Square Pyramid and Octahedral Geometries. The notation a12 describes the angle between atom 1, the metal and atom 2. The atom indices correspond to Fig. 5-3. bm is the mean of the four angles between the apical bond and the basal bonds in square pyramid geometries. Type Coordination Angle (Deg) Atoms

ML4 Tetrahedral 109.5 Square Planar 180.0 a12, a34 90.0 All others ML5 Trigonal Bipyramidal 180.0 a12 120.0 a34, a45, a35 90.0 All others 1 4 Square Pyramid bm = 4 i=1 ai5 a15, a25, a35, a45 (360.0 2b ) a , a − m 12 34 2 sin−1 2−1/2 [sin(180P .0 b )] a , a , a , a − m 13 23 14 24 ML Octahedral 180.0 a , a , a 6  12 34 56 90.0 All others

ligands in tetrahedral environments are shown in Fig. 5-5. Here the 1 letter amino acid

codes are used with X describing an unknown ligand such as a non-standard amino acid or drug molecule. The most common tetrahedral Zn environment is CCCC or four cysteines bound, followed by CCCH, CCHH, DHHH, HHHO, HHHX, CCHX, and CCHO as shown in Fig.

5-5. This data led the research in a direction to investigate the relationship between Zinc coordination environment and geometric parameters such as bond lengths and angles. The bonds between Zinc and Sulfur, Nitrogen and Oxygen from 10 unique primary shell environments are shown in Table 5-5. The distribution of Zinc-Sulfur bonds in proteins that contain environments such as

CCCC, CCCH, CCHH, and CHHH are shown Fig. 5-6. A Box plot as shown in 5-7(a) may also be used to represent this data as it shows the max and min values, the lower and upper quartiles and the median. The Box plots of the variation of Zn-N bonds in the series CCCC, CCCH, CCHH, and CHHH is shown in Fig. 5-7(b).

155 Number 1000 1500 2000 2500 3000 3500 500 0

oct sqp tbp tet tnb trp unk

Coordination Type

Figure 5-4. Zinc Coordination Geometry Distribution from the PDB.

The peaks of the Zn S bond length distributions in the CCCC and CCCH − environments lie between 2.3 and 2.4A,˚ while for CCHH and CHHH systems it occurs between 2.2 and 2.3A.˚ There are only 14 instances of CHHH and the Zn-S and Zn-N bonds have large standard deviations of 0.1364 and 0.1403 respectively. Also in the case of the Zn-S bonds the median differs from the mean considerably (2.296 compared to 2.344), thus suggesting unreliability of this data. The Box plots also point to some outliers in the data, for example there are Zn-S bonds in the PDB below 1.5 A˚ which upon visualize inspection seem crowded and poorly resolved. The distribution of Zinc-Oxygen, and -Nitrogen bonds are shown in Fig. 5-8. Zinc bonded to Glutamic and Aspartic acid oxygens (GLU@OE1/OE2 or ASP@OD1/OD2) or

histidine nitrogens (HIS@ND1/NE2) all show similar behavior with bonds lengths around 2.1A˚ being most common. In spite of this similarity the standard deviations of ASP and

156 Table 5-5. Tetrahedral Zinc Primary Ligating Residues. Bond lengths are in A.˚ 1 letter amino acid codes are used; C:CYS, H:HIS, O:HOH, D:ASP. N is the number of Bond instances. Min and Max are the minimum and maximum bond lengths respectively. The 1st Quartile, 3rd Quartile, mean, median, and standard deviation are statistical parameters to describe the bond length distribution.

N Bond Min 1st Median Mean 3rd Max Standard Quartile Quartile Deviation CCCC 3284 Zn-S 1.424 2.294 2.338 2.338 2.389 2.805 0.1218 CCCH 1041 Zn-S 1.448 2.284 2.332 2.332 2.382 3.047 0.1089 CCHH 334 Zn-S 1.908 2.234 2.295 2.301 2.361 2.795 0.1289 CHHH 14 Zn-S 2.180 2.270 2.296 2.344 2.390 2.608 0.1364 CCCH 347 Zn-N 1.833 2.056 2.124 2.132 2.200 2.525 0.1157 CCHH 334 Zn-N 1.716 2.023 2.078 2.088 2.149 2.465 0.1188 CHHH 42 Zn-N 1.778 1.964 2.034 2.056 2.113 2.486 0.1403 HHHH 12 Zn-N 1.935 2.006 2.040 2.049 2.107 2.129 0.0627 HHHO 108 Zn-O 1.359 2.000 2.252 2.185 2.362 2.518 0.2218 HHOO 42 Zn-O 1.866 2.092 2.268 2.233 2.384 2.543 0.1816 HOOO 78 Zn-O 1.611 2.006 2.143 2.115 2.241 2.495 0.1914 OOOO 12 Zn-O 1.781 2.004 2.158 2.135 2.300 2.428 0.1917 HHHO 324 Zn-N 1.872 2.044 2.098 2.116 2.176 2.757 0.1140 HHOO 42 Zn-N 1.850 2.060 2.143 2.161 2.260 2.453 0.1455 HOOO 26 Zn-N 1.836 2.041 2.089 2.102 2.121 2.459 0.1295 HHHD 155 Zn-O 1.688 1.914 2.000 2.007 2.086 2.457 0.1425 HHDD 68 Zn-O 1.805 2.053 2.148 2.166 2.262 2.938 0.1840 HHHD 465 Zn-N 1.604 2.000 2.064 2.077 2.144 2.499 0.1275 HHDD 68 Zn-N 1.959 2.101 2.184 2.192 2.302 2.460 0.1280 ASP 460 Zn-O 1.688 1.958 2.044 2.077 2.165 2.988 0.1899 GLU 227 Zn-O 1.462 1.996 2.102 2.134 2.265 2.823 0.2276 HIS 825 Zn-ND 1.716 2.030 2.096 2.107 2.181 2.525 0.1256 HIS 1768 Zn-NE 1.604 2.031 2.093 2.108 2.177 2.757 0.1228

GLU bonds are greater than those of HIS. Also it is worth nothing that the majority of Zn

histidine bonding is through the epsilon Nitrogen. 5.3.2 Tetrahedral Zn Environment Force Field Parameterization

Metal-ligand bonds are softer than those of organic molecules; however, there are obvious trends in the data presented above. These findings encouraged the formation of a generalized FF for Zn called the Zinc AMBER Force Field or ZAFF. A concept key to this work is “plug-and-play” where a researcher can download a metalloprotein structure

157 HSXX HOOO HOO HHXX HHX HHOX HHOO HHHX HHHO HHH EHOO EHH EEHH EEH DOO DHOO DHO DHHX DHHO DHHH DHH DDHH DDEX CHHX CHHO Primary Shell Ligands CHHH CEHO CCHX CCHO CCHH CCH CCEH CCDH CCCX CCCO CCCH CCCC CCC

0 200 400 600 800 Number

Figure 5-5. The Most Common Tetrahedral Zinc Coordinating ligands Combination Distribution. Three lettered environments also contain a secondary ligand not shown.

158 from the PDB and run dynamics using predefined parameters as illustrated in Fig. 5-9. To this end FFs for the 10 unique environments shown in Table 5-5 were built using MCPB. A single structure from the PDB representative of each environment was chosen. FFs were built using the B3LYP DFT method [315–317] with the 6 31G basis set [318]. − ∗ The resulting FFs were stored in cluster xml files using the definitions of stdLibrary and stdGroup from Ch. 3.2.3 for later use. The 1A5T structure from the PDB was used as a representative structure of the Zn-CCCC cluster. Two models of this cluster were built using MCPB as shown in Fig. 5-10. The calculated bond and angle force constants are shown in Tables 5-6 and 5-7. The

average Zn S bond length is 2.426A˚ which is higher than the mean value from the survey − of the PDB but it is within one standard deviation. The corresponding mean bond force constant is 100.677 kcal/(mol A˚2). The mean S Zn S and C S Zn angles are 109.138◦ · − − − − and 101.754◦ with an average force constants of 15.016 and 81.505 kcal/(mol rad2) · respectively.

Table 5-6. Zn-CCCC Cluster Bond Lengths, r, in A˚ and Force Constants, Kr, in kcal/(mol A˚2) (PDB ID: 1A5T). · Bond r Kr ZN-S1 2.42511 100.709 ZN-S2 2.42442 101.586 ZN-S3 2.42459 101.013 ZN-S4 2.43049 99.4008 Mean 2.4260 100.677

The structures 1A73 and 2GIV from the PDB were used as characteristic structures

of the Zn-CCCH cluster. The delta Nitrogen of Histidine is bonded to the Zinc atom in 1A73 while the epsilon Nitrogen is bound in 2GIV. Two models of each cluster were built using MCPB as shown in Fig. 5-11. The bond lengths and force constants are tabulated in Table 5-8, while the angles and angle force constants are shown in Tables 5-9 and 5-10.

The average Zn S bond lengths are 2.355A˚ and 2.352A˚ which are in good agreement − with the values determined from the PDB survey. The Zn S bond lengths are shorter in −

159 Table 5-7. Zn-CCCC Cluster Angles, θ, in Degrees and Force Constants, Kθ, in kcal/(mol rad2) (PDB ID: 1A5T). · S-Zn-S2 S-Zn-S3 S-Zn-S4 C-S-Zn θ S1 107.384 111.062 108.75 101.721 S2 109.664 110.75 102.136 S3 109.22 101.952 S4 101.210 Mean 109.138 101.754 Kθ S1 13.9532 16.0208 13.0393 74.2777 S2 15.2898 16.7256 84.6981 S3 15.0688 85.4549 S4 81.5913 Mean 15.016 81.505

Zn-CCCH than they are in the Zn-CCCC cluster and this corresponds to the change in force constant from 100.677 to 144.003 and 142.575 kcal/(mol A˚2). The mean S Zn S, · − − S Zn N, and C S Zn angles for 1A73 and 2GIV clusters are 115.161◦/116.381◦, − − − − 102.938◦/101.270◦ and 102.534◦/101.793◦ with an average force constants of 13.695/10.377, 21.909/15.213, and 78.568/65.170 kcal/(mol rad2) respectively. ·

Table 5-8. Zn-CCCH Cluster Bond Lengths, r, in A˚ and Force Constants, Kr, in kcal/(mol A˚2) (PDB ID: 1A73 and 2GIV). · Bond r Kr 1A73 Zn-S1 2.38103 129.940 Zn-S2 2.33594 153.514 Zn-S3 2.34818 148.555 Zn-NB 2.17615 111.727 Zn-S Mean 2.355 144.003 2GIV Zn-S1 2.35365 140.810 Zn-S2 2.37024 135.986 Zn-S3 2.33396 150.931 Zn-NB 2.14457 109.821 Zn-S Mean 2.352 142.575

The 1A1F structure from the PDB was used as a representative structure of the Zn-CCHH cluster. Two models of this cluster were built using MCPB as shown in Fig.

160 Table 5-9. Zn-CCCH Cluster Angles, θ, in Degrees and Force Constants, Kθ, in kcal/(mol rad2) (PDB ID: 1A73). · Angle Angle Angle Angle S-Zn-S2 S-Zn-S3 S-Zn-NB C-S-Zn θ S1 112.941 117.548 98.457 101.452 S2 114.996 110.215 103.370 S3 100.144 102.782 Mean 115.161 102.938 102.534 CR-NB-Zn CC-NB-Zn NB 118.832 133.957

Kθ S1 14.8019 12.3254 31.4032 66.5568 S2 13.9588 15.4264 82.7660 S3 18.8974 86.3823 Mean 13.695 21.909 78.568 CR-NB-Zn CC-NB-Zn NB 61.5645 62.4920

Table 5-10. Zn-CCCH Cluster Angles, θ, in Degrees and Force Constants, Kθ, in kcal/(mol rad2) (PDB ID: 2GIV). · Angle Angle Angle Angle S-Zn-S2 S-Zn-S3 S-Zn-NB C-S-Zn θ S1 122.16 114.051 99.8816 100.693 S2 112.932 94.1401 102.432 S3 109.790 102.254 Mean 116.381 101.270 101.793 CR-NB-Zn CV-NB-Zn NB 124.745 128.494 Kθ S1 9.20218 11.0341 15.3383 59.2390 S2 10.8972 20.9385 72.5796 S3 9.36319 63.6916 Mean 10.377 15.213 65.170 CR-NB-Zn CV-NB-Zn NB 33.4104 31.3493

161 5-12 with the calculated bond and angle force constants shown in Tables 5-11 and 5-12. The average Zn S and Zn N bond length are 2.305A˚ and 2.088A˚ with corresponding − − force constants of 181.478 and 147.126 kcal/(mol A˚2) respectively. The average value · of the Zn S bond length from the PDB is 2.301A˚ while the mean Zn N value is − − 2.088A˚ which are in excellent agreement with the calculated values. Both the Zn S − and Zn N bonds are shorter than the previous clusters and this is reflected in stronger − force constants been determined. The mean S Zn N, C S Zn, and C N Zn, angles − − − − − − for the 1A1F cluster are 103.232◦, 105.054◦, and 126.586◦ with average force constants of 12.488, 69.269, and 34.502 kcal/(mol rad2) respectively. ·

Table 5-11. Zn-CCHH Cluster Bond Lengths, r, in A˚ and Force Constants, Kr, in kcal/(mol A˚2) (PDB ID: 1A1F). · Bond r Kr Zn-S1 2.28799 191.997 Zn-S2 2.32226 170.959 Zn-NZ 2.09197 143.964 Zn-NY 2.08529 150.288 Zn-S Mean 2.305 181.478 Zn-N Mean 2.088 147.126

The final Zn center considered in this study which contains a cysteine residue was

Zn-CHHH. The 1CK7 structure from the PDB was used to model the Zn-CHHH cluster. Two models of this cluster were built using MCPB as shown in Fig. 5-13 with the calculated bond and angle force constants shown in Tables 5-13 and 5-14. The Zn S bond − length was determined as 2.262A˚ with a force constant of 186.196 kcal/(mol A˚2). The · mean Zn N bond length is 2.046A˚ with a force constant of 180.437 kcal/(mol A˚2). The − · mean N Zn N and S Zn N angles are 105.835◦ and 112.950◦ with force constants of − − − − 2.795 and 12.488 kcal/(mol rad2) respectively. · There are a very small number of Zinc atoms bound to four histidine residues in the

PDB. But to complete this computational study the bond and angle force constants were determined using 1PB0 as a starting geometry. The models created by MCPB are shown in Fig. 5-14 with the resulting bond lengths and angles and accompanying force constants

162 Table 5-12. Zn-CCHH Cluster Angles, θ, in Degrees and Force Constants, Kθ, in kcal/(mol rad2) (PDB ID: 1A1F). · Angle Angle Angle Angle S/N-Zn-S2 S/N-Zn-NY S/N-Zn-NZ C-S-Zn θ S1 135.025 101.200 108.303 104.986 S2 102.721 100.705 105.122 NY 106.293 CR-N-Zn CV-NB-Zn NY 123.668 129.422 NZ 123.392 129.863 Kθ S1 9.71416 10.1761 10.1241 63.3345 S2 14.5105 15.1437 75.2045 NY 8.41020 CR-N-Zn CV-NB-Zn NY 36.1351 32.4650 NZ 36.2138 33.1944 Mean θ Kθ S-Zn-N 103.232 12.488 C-S-Zn 105.054 69.269 C-N-Zn 126.586 34.502

Table 5-13. Zn-CHHH Cluster Bond Lengths, r, in A˚ and Force Constants, Kr, in kcal/(mol A˚2) (PDB ID: 1CK7). · Bond r Kr Zn-S1 2.26178 186.196 Zn-NX 2.05563 176.880 Zn-NY 2.04663 182.100 Zn-NZ 2.03803 182.333 Zn-N Mean 2.046 180.437

are shown in Tables 5-15 and 5-16. The average Zn N bond distance is 2.010A˚ with a − force constant of 217.616 kcal/(mol A˚2). The angles of the Zn center are 109.481◦ with a · force constant of 6.088 kcal/(mol rad2). · It is evident there are clear trends in the calculated bond lengths and force constants described above. The bond lengths of Zn-S through the series CCCC, CCCH, CCHH, and CHHH correlate with the calculated force constants with an R2 value of 0.97 as seen in

Fig. 5-15(a). The Zn N bond lengths and force constants correlate with an R2 of 0.95 as −

163 Table 5-14. Zn-CHHH Cluster Angles, θ, in Degrees and Force Constants, Kθ, in kcal/(mol rad2) (PDB ID: 1CK7). · Angle Angle Angle Angle S/N-Zn-NX S/N-Zn-NY S/N-Zn-NZ C-S-Zn θ S1 108.625 110.015 120.210 104.518 NX 109.309 104.101 NY 104.096 CR-N-Zn CV-NB-Zn NX 118.357 135.099 NY 133.352 120.137 NZ 127.083 125.963 Kθ S1 4.05583 4.03272 1.89062 10.5385 NX 2.38629 2.68101 NY 3.32002 CR-N-Zn CV-NB-Zn NX 23.1399 ND NY 34.1629 37.8760 NZ 18.0772 9.73914 Mean θ Kθ N-Zn-N 105.835 2.795 S-Zn-N 112.950 12.488

Table 5-15. Zn-HHHH Cluster Bond Lengths, r, in A˚ and Force Constants, Kr, in kcal/(mol A˚2) (PDB ID: 1PB0). · Bond r Kr Zn-NW 2.00656 221.593 Zn-NX 2.01037 217.622 Zn-NY 2.01428 213.845 Zn-NZ 2.01130 217.407 Zn-N Mean 2.010 217.616

shown in Fig. 5-15(b). It is worth noting here that Zn donor bond lengths differ within environments. Thus having a single Zn S or Zn N bond equilibrium and force constant − − value would not work. The proposed solution to this problem is to store all Zn bond types and assign the parameters in an automatic manner within the metal center perception algorithm of MTK++. The average angle size and force constants of S Zn S are smaller and stronger, − − 109.138◦/(15.016) kcal/(mol rad2) in the CCCC cluster compared to those of the ·

164 Table 5-16. Zn-HHHH Cluster Angles, θ, in Degrees and Force Constants, Kθ, in kcal/(mol rad2) (PDB ID: 1PB0). · Angle N-Zn-NX N-Zn-NY N-Zn-NZ CR-N-Zn θ NW 111.145 106.809 110.679 127.688 NX 111.299 106.671 127.492 NY 110.288 127.172 NZ 127.161 CV-N-Zn NW 126.018 NX 126.229 NY 126.584 NZ 126.540 Kθ NW 5.49705 6.93160 5.80074 33.8058 NX 5.64881 7.14382 32.1774 NY 5.50663 31.7145 NZ 32.1549 CV-N-Zn NW 34.1690 NX 32.2668 NY 32.0421 NZ 32.3428 Mean θ Kθ N-Zn-N 109.481 6.088 CR-N-Zn 127.378 32.463 CV-N-Zn 126.342 32.705

CCHH cluster where values of 135.025◦ and 9.714 kcal/(mol rad2) were determined. · The N Zn N angles of the CCHH, CHHH, and HHHH clusters lie between 105.835◦ − − and 109.481◦ with force constants between 2.795 and 8.4102 kcal/(mol rad2). The · experimental force constant of the N Zn N was reported to be approximately 5.0 − − kcal/(mol rad2) which is in good agreement with those calculated here. It has been · reported that this angle force constant is too weak to prevent the angle opening beyond the ideal tetrahedral angle in MD simulations and in the past arbitrary scaling factors have been applied to prevent this from occurring [252, 293]. A general scaling factor has not been developed here as this study was designed to investigate the raw force constants

165 produced by QM packages but it may be further developed in the future. Thus the FFs shown in this chapter can be described as the zeroth order with further corrections necessary to carry out meaningful simulations. The partial charges of the Zinc clusters CCCC, CCCH, CCHH, and CHHH were determined applying two different methods using the larger models described above. The first method allows all atoms of the bound residue to change (ChgModA) while the second technique restrains the backbone atoms (CA, H, HA, N, HN, C, O) to those values found in the AMBER parm94 force field (ChgModB). The charges were determined by first calculating the MK charges from Gaussian (1.1A˚ radius for Zinc was used) and then using the RESP program to zero out the charges on the capping groups. This procedure was carried out to disperse the charge over the entire cluster thus removing the need to have a large +2 formal charge on Zinc. The ChgModA charges are shown in Table 5-17 and ChgModB charges are presented in Table 5-18. The SG atom in CYM (unprotonated

cysteine) residue in parm94 has a charge of 0.736 and fluctuates between 0.485 − − and 0.640 in ChgModA and between 0.473 and 0.669 in ChgModB. One of the − − − biggest differences between ChgModA and ChgModB are the charges on CB. CYM@CB has a 0.736 charge while in ChgModA its charge lies closer to zero. ChgModB on − the other hand places 0.4 on CB. It is unclear whether ChgModA is superior to ≈ − ChgModB or vice versa. Though it could be advantageous to keep the charges of the backbone atoms fixed to the parm94 values as these have been used in the fitting of the torsional parameters of the FF. It is also difficult to determine if this would matter as the movement of the residues bound to Zinc would be restricted. The ChgModA and

ChgModB charges for the Zinc- CCCH, CCHH, CHHH, and HHHH clusters are outlined in Tables 5-19 and 5-20. The variation of bond distances, angles, and partial charges of Zinc clusters containing histidine residues and water molecules was determined. The 1CA2 structure was used

to represent the Zn-HHHO cluster which is a structure of human Carbonic Anhydrase II

166 Table 5-17. Cysteine Charges using ChgModA for the Zn-CCCC, -CCCH, -CCHH, and -CHHH Clusters. Partial Charges are in electron units. Residue N CA C O CYM -0.463000 0.035000 0.616000 -0.504000 CCCC -0.479963 0.003180 0.518101 -0.548044 CCCH (1A73) CY1/CY3 -0.474605 -0.108977 0.595840 -0.519264 CCCH (1A73) CY2 -0.414898 0.031295 0.278787 -0.450243 CCCH(2GIV) -0.542307 0.017445 0.632188 -0.568329 CCHHCY1 -0.454687 0.005167 0.245131 -0.289983 CCHHCY2 -0.268043 -0.035885 0.462357 -0.532708 CHHH -0.447464 -0.025959 0.622632 -0.497571

CB SG HN HA CYM -0.736000 -0.736000 0.252000 0.048000 CCCC 0.112192 -0.620103 0.281750 0.057052 CCCH (1A73) CY1/CY3 0.019831 -0.640071 0.287703 0.088639 CCCH (1A73)CY2 0.049561 -0.484948 0.266828 0.052328 CCCH(2GIV) -0.048990 -0.591652 0.331010 0.049510 CCHHCY1 -0.097120 -0.581202 0.297192 0.126035 CCHHCY2 0.074672 -0.537449 0.295118 0.088752 CHHH -0.056717 -0.512655 0.277771 0.099225

HB2/3 ZN Charge CYM 0.244000 CCCC 0.002066 0.686817 -2.0 CCCH (1A73)CY1/CY3 0.065870 0.593065 -1.0 CCCH(1A73)CY2 0.056778 0.593065 -1.0 CCCH(2GIV) 0.084419 0.359392 -1.0 CCHHCY1 0.084759 0.263107 0.0 CCHHCY2 0.031256 0.263107 0.0 CHHH 0.044767 0.552747 1.0

Table 5-18. Cysteine Charges using ChgModB for the Zn-CCCC, -CCCH, -CCHH, and -CHHH Clusters. Partial Charges are in electron units. Residue CB SG HB2/3 ZN CYM -0.736000 -0.736000 0.244000 CCCC -0.462072 -0.530008 0.169606 0.675474 CCCH (1A73) CY1/CY3 -0.372264 -0.533198 0.171501 0.501937 CCCH (1A73) CY2 -0.324838 -0.472872 0.175531 0.501937 CCCH(2GIV) -0.435825 -0.555154 0.212869 0.45405 CCHHCY1 0.003587 -0.668561 0.073015 0.362845 CCHHCY2 -0.221037 -0.531646 0.134076 0.362845 CHHH 0.159424 -0.552344 -0.017513 0.505550

167 Table 5-19. Histidine Charges using ChgModA for the Zn-CCCC, -CCCH, -CCHH, and -CHHH Clusters. Partial Charges are in electron units. Residue N CA C O HIE -0.415700 -0.058100 0.597300 -0.567900 HID -0.415700 0.018800 0.597300 -0.567900 CCCH (HIE) 0.013815 -0.094517 0.609742 -0.511703 CCCH (HID) -0.465489 0.012311 0.744779 -0.614433 CCHH (HID) -0.411312 -0.035148 0.603874 -0.536756 CHHH (HID) -0.385664 -0.089931 0.683060 -0.554579 HHHH (HID) -0.321934 -0.129789 0.661692 -0.523604 Residue CB CG ND1 CD2 HIE -0.007400 0.1868000 -0.54320 -0.220700 HID -0.046200 -0.0266000 -0.38110 0.129200 CCCH (HIE) -0.009333 0.1388840 -0.200272 -0.231676 CCCH (HID) -0.035338 -0.1183230 -0.143343 -0.003696 CCHH (HID) -0.087860 -0.0472060 -0.121450 -0.070041 CHHH (HID) -0.171947 0.0159560 -0.096528 0.003599 HHHH (HID) -0.146428 -0.0046330 -0.130169 -0.067660 Residue CE1 NE2 H HA HIE 0.163500 -0.279500 0.271900 0.136000 HID 0.205700 -0.572700 0.271900 0.088100 CCCH (HIE) -0.002223 -0.152285 -0.043986 0.015705 CCCH (HID) 0.051160 -0.085718 0.269102 0.080535 CCHH (HID) -0.080618 -0.041354 0.269483 0.108816 CHHH (HID) -0.064531 -0.228608 0.285362 0.104929 HHHH (HID) -0.059695 -0.145091 0.242073 0.135166 Residue HB2 HB3 HE1 HE2 HIE 0.036700 0.036700 0.143500 0.333900 HID 0.040200 0.040200 0.139200 CCCH (HIE) 0.028328 0.028328 0.144631 0.307178 CCCH (HID) 0.029513 0.029513 0.121003 CCHH (HID) 0.072346 0.072346 0.175582 CHHH (HID) 0.107686 0.107686 0.186129 HHHH (HID) 0.120901 0.120901 0.185187 Residue HD1 HD2 ZN Charge HIE 0.186200 0.0 HID 0.364900 0.114700 0.0 CCCH(HIE) 0.162384 0.593065 -1.0 CCCH(HID) 0.300711 0.125183 0.359392 -1.0 CCHH(HID) 0.321771 0.161285 0.263107 0.0 CHHH(HID) 0.297609 0.099257 0.552747 1.0 HHHH(HID) 0.322378 0.137633 0.412289 2.0

168 (HCA II) with the MCPB models shown in Fig. 5-16. As mentioned in Ch. 2.14 HCA II

is a catalytic center for the conversion of CO2 into bicarbonate. Therefore to account for both the water and hydroxyl states two FFs were evaluated. It is of no surprise that the bond lengths and associated force constants of the two systems are different. The Zn O − bond is longer in the case of water binding while the Zn N are shorter due to the strength − of the hydroxyl bond as outlined in Table 5-21. The accompanying force constants are also considerably different. The Zn O bond force constants changes from 120.287 to 394.674 − kcal/(mol A˚2) upon removal of a proton while the Zn N force constant becomes weaker · − from 248.420 to 194.357 kcal/(mol A˚2). These calculated equilibrium bond lengths and · force constants are considerably different from those published by Hoops et al.; however, the QM methods used to generate the numbers also differ. The Zn O bond lengths of the − HHHO clusters in the PDB have a large standard deviation of 0.222A˚ with a mean value of 2.185A˚ confirming that both states exist. The calculated angles and force constants for

this cluster are shown in Table 5-22 which are in good agreement with those published previously, except for the H O Zn angle force constant that was arbitrarily set to a − − higher value. The 1VLI structure from the PDB was used to investigate the strength of bond and angle force constants of the Zn-HHOO cluster. Again the MCPB program was used to build the models (Fig. 5-17) required for parameterization with the resulting equilibrium bond lengths and angles and corresponding force constants shown in Tables 5-23 and 5-24. The equilibrium bond length of Zn O was calculated as 1.946A˚ which is 0.4A˚ shorter − ≈ than the bond length for the Zn-HHHO cluster. This contradicts the trend from the PDB

survey. Plausible reasons for this discrepancy include the small number of data points for the Zn-HHOO cluster in the PDB and the large standard deviation value of 0.146A.˚ The angle force constants calculated for this cluster are of similar magnitude to those calculated for the Zn-HHHO cluster.

169 Table 5-20. Histidine Charges using ChgModB for the Zn-CCCC, -CCCH, -CCHH, and -CHHH Clusters. Partial Charges are in electron units. Residue CB CG ND1 CD2 HIE -0.007400 0.186800 -0.54320 -0.220700 HID -0.046200 -0.026600 -0.38110 0.129200 CCCH (HIE) -0.390226 0.172204 -0.218625 -0.240248 CCCH (HID) 0.052173 -0.062260 -0.152337 -0.093846 CCHH (HID) 0.021453 -0.074575 -0.103342 -0.083769 CHHH (HID) 0.097861 -0.131328 -0.072847 0.010476 HHHH (HID) 0.299667 -0.158251 -0.10536 -0.114661

Residue CE1 NE2 HB2/3 HD1 HIE 0.163500 -0.279500 0.036700 HID 0.205700 -0.572700 0.040200 0.364900 CCCH (HIE) 0.025646 -0.1386 0.176704 CCCH (HID) 0.044669 -0.107399 0.001993 0.302699 CCHH (HID) -0.090768 -0.047611 0.034279 0.317081 CHHH (HID) -0.066727 -0.200851 0.048588 0.295541 HHHH (HID) -0.067193 -0.078907 0.001086 0.310647

Residue HE1 HE2 HD2 ZN HIE 0.143500 0.333900 0.186200 HID 0.139200 0.114700 CCCH (HIE) 0.125452 0.300742 0.164376 0.501937 CCCH(HID) 0.125431 0.184058 0.45405 CCHH(HID) 0.171776 0.165012 0.362845 CHHH(HID) 0.184941 0.106056 0.50555 HHHH(HID) 0.181314 0.15876 0.317246

Table 5-21. Zn-HHHO Cluster Bond Lengths, r in A˚ and Force Constants, Kr, in kcal/(mol A˚2) (PDB ID: 1CA2). · Bond r Kr

H2O Zn-NX 1.9783 250.691 Zn-NY 1.9817 246.526 Zn-NZ 1.9836 248.043 Zn-OW 2.1122 120.287 Zn-N Mean 1.981 248.420 HO− Zn-NX 2.0293 190.815 Zn-NY 2.0400 194.203 Zn-NZ 2.0412 198.055 Zn-OW 1.8596 394.674 Zn-N Mean 2.036 194.357

170 Table 5-22. Zn-HHHO Cluster Angles, θ, in Degrees and Force Constants, Kθ, in kcal/(mol rad2) (PDB ID: 1CA2). · Angle Angle Angle Angle N-Zn-NY N-Zn-NZ N-Zn-OW CR-N-Zn

H2O θ NX 109.645 123.191 100.449 128.302 NY 113.479 109.534 127.604 NZ 98.1932 125.145 CV/CC-N-Zn H-OW-Zn NX 125.323 NY 126.100 NZ 127.730 HW 123.942 H2O Kθ NX 8.00872 8.22809 5.61984 34.2350 NY 7.62091 3.74798 34.5691 NZ 5.75939 40.5598 CV/CC-N-Zn H-OW-Zn NX 34.0616 NY 35.2802 NZ 44.0110 HW 20.5484 HO− θ NX 106.050 107.502 124.853 126.720 NY 116.017 102.656 114.165 NZ 100.380 113.620 CV/CC-N-Zn H-OW-Zn NX 126.592 NY 139.179 NZ 138.859 HW 116.648 − HO Kθ NX 9.25545 9.22715 7.39156 29.5398 NY 6.99278 10.4632 38.9903 NZ 12.0718 42.5650 CV/CC-N-Zn H-OW-Zn NX 30.0396 NY 33.8686 NZ 37.2270 HW 38.9579

171 1000 1500 2000 Frequency Frequency 500 100 200 300 400 500 600 0 0

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0

Bond Lengths Bond Lengths

(a) CCCC (b) CCCH 100 150 Frequency Frequency 0 50 0 2 4 6 8 10

1.8 2.0 2.2 2.4 2.6 2.8 3.0 2.0 2.2 2.4 2.6 2.8

Bond Lengths Bond Lengths

(c) CCHH (d) CHHH

Figure 5-6. Zn-S Bond Length Distributions in CCCC, CCCH, CCHH, and CHHH Tetrahedral Environments.

Table 5-23. Zn-HHOO Cluster Bond Lengths, r, in A˚ and Force Constants, Kr, in kcal/(mol A˚2) (PDB ID: 1VLI). · Bond r Kr Zn-NX 1.9443 199.330 Zn-NY 1.9488 286.855 Zn-OX 2.0577 157.459 Zn-OY 2.0493 77.0940 Zn-N Mean 1.946 243.092 Zn-O Mean 2.053 117.276

172 iue57 o lt fZ-/ odLntsi CC CH CC CCCH, CCCC, in Lengths Bond Zn-S/N of Plots Box 5-7. Figure b nNBn egh hog h eisHH,HC,HC,an HHCC, HCCC, HHHH, series the through Lengths Bond Zn-N (b) HHenvironments. HHHH a nSBn egh hog h eisCC,CC,CH,an CCHH, CCCH, CCCC, series the through Lengths Bond Zn-S (a)

CCCH CCHH CHHH HHHH CCCC CCCH CCHH CHHH . . . 3.0 2.5 2.0 1.5 . . . 3.0 2.5 2.0 1.5 Zn−N BondDistance( Zn−S BondDistance( 173 A A ° ° ) ) H HH and CHHH, HH, CHHH. d HHHC. d Table 5-24. Zn-HHOO Cluster Angles, θ, in Degrees and Force Constants, Kθ, in kcal/(mol rad2) (PDB ID: 1VLI). · Angle N-Zn-NY N-Zn-OX O/N-Zn-OY θ NX 121.973 108.959 112.588 NY 107.009 105.529 OX 98.0344 CV-N-Zn CR-N-Zn HW-O-Zn NX 122.128 131.376 NY 126.588 126.944 OX 124.788 OY 122.326 Kθ NX 5.81366 5.71586 ND NY 4.36309 4.47123 OX 3.69369 CV-N-Zn CR-N-Zn HW-O-Zn NX 10.6697 16.9386 NY 33.5296 33.5677 OX 26.4380 OY ND

174 100 150 Frequency Frequency 0 50 0 10 20 30 40 50 60 70

1.5 2.0 2.5 3.0 1.5 2.0 2.5 3.0

Bond Lengths Bond Lengths

(a) Asp@OD1/OD2 (b) Glu@OE1/OE2 Frequency Frequency 100 150 200 250 300 350 200 400 600 800 0 50 0

1.6 1.8 2.0 2.2 2.4 2.6 1.6 1.8 2.0 2.2 2.4 2.6 2.8

Bond Lengths Bond Lengths

(c) His@ND1 (d) His@NE2

Figure 5-8. Tetrahedral Zn-O(Asp/Glu) and Zn-N(His) Bond Length Distributions.

175 Contains Cluster PDB Yes Metal? Stored?

No No Yes Carry out steps End in Fig. 5-2.

Figure 5-9. ZAFF Flow Diagram. This illustration demonstrates when a metalloprotein structure is downloaded from the PDB and an equivalent metal site is stored the MTK++ package has the ability to assign parameters to carry out MD simulations.

(a) 1A5T Model 1 (b) 1A5T Model 2

Figure 5-10. Zn-CCCC Cluster Models (PDB ID: 1A5T).

176 (a) 1A73 Model 1 (b) 1A73 Model 2

(c) 2GIV Model 1 (d) 2GIV Model 2

Figure 5-11. Zn-CCCH Cluster Models (PDB ID: 1A73 and 2GIV).

177 (a) 1A1F Model 1 (b) 1A1F Model 2

Figure 5-12. Zn-CCHH Cluster Models (PDB ID: 1A1F).

(a) 1CK7 Model 1 (b) 1CK7 Model 2

Figure 5-13. Zn-CHHH Cluster Models (PDB ID: 1CK7).

178 (a) 1PB0 Model 1 (b) 1PB0 Model 2

Figure 5-14. Zn-HHHH Cluster Models (PDB ID: 1PB0).

179 ●

● ) 2 ° A

● ● ●

● Zn−S Force Constant (kcal/mol

●● ● 100 120 140 160 180

2.30 2.35 2.40 ° Zn−S Calculated Bond Distance (A)

(a) The Correlation between Zn-Cys@S Bond Lengths and Calculated Force Constants through the Series CCCC, CCCH, CCHH, and CHHH.

● ●● ● ) 2 ° A

● ●

● Zn−N Force Constant (kcal/mol 120 140 160 180 200 220 ● ●

2.00 2.05 2.10 2.15 ° Zn−N Calculated Bond Distance (A)

(b) The Correlation between Zn-His@N Bond Lengths and Calculated Force Constants through the Series CCCH, CCHH, CHHH, and HHHH.

Figure 5-15. The Correlation between (a) Zn-Cys@S and (b) Zn-His@N Bond Lengths and Calculated Force Constants through the Series CCCC, CCCH, CCHH, CHHH, and HHHH.

180 (a) 1CA2 Model 1 (b) 1CA2 Model 2

Figure 5-16. Zn-HHHO Cluster Models (PDB ID: 1CA2).

(a) 1VLI Model 1 (b) 1VLI Model 2

Figure 5-17. Zn-HHOO Cluster Models (PDB ID: 1VLI).

181 The final tetrahedral environment containing histidine residues and water molecules was the HOOO cluster. The 1L3F PDB structure was used with MCPB models shown in Fig. 5-18. The average Zn O and Zn N bond lengths are 2.01A˚ and 1.926A˚ which are − − shorter distances than those in the Zn-HHHO and Zn-HHOO clusters, agreeing with the

experimental means from the PDB.

(a) 1L3F Model 1 (b) 1L3F Model 2

Figure 5-18. Zn-HOOO Cluster Models (PDB ID: 1L3F).

Table 5-25. Zn-HOOO Cluster Bond Lengths, r, in A˚ and Force Constants, Kr, in kcal/(mol A˚2) (PDB ID: 1L3F). · Bond r Kr Zn-NX 1.9256 325.449 Zn-OW 2.0170 179.519 Zn-OX 2.0022 189.759 Zn-OY 2.0136 181.502 Zn-O Mean 2.0100 183.593

The histidine residues, water molecules and zinc ion partial charges for the Zn-HHHO,

-HHOO, and -HOOO clusters are outlined in Table 5-27. Two clusters containing histidine and aspartate residues were considered in this study. The 2USN and 1U0A were chosen as characteristic structures of the Zn-HHHD and Zn-HHDD (Fig. 5-19) environments. The PDB survey showed that the bond lengths

182 Table 5-26. Zn-HOOO Cluster Angles, θ, in Degrees and Force Constants, Kθ, in kcal/(mol rad2) (PDB ID: 1L3F). · Angle N/O-Zn-OW N-Zn-OX O/N-Zn-OY θ NX 114.389 118.388 118.918 OW 103.216 97.8629 OX 100.927 CR-N-Zn CC-N-Zn HW-O-Zn NX 125.909 126.608 OW 127.562 OX 127.232 OY 124.457 Kθ NX 3.83477 8.16145 4.17334 OW 4.28235 3.60452 OX 4.24390 CR-N-Zn CC-N-Zn HW-O-Zn NX 44.5591 51.1512 OW 23.8887 OX 24.9508 OY 25.1823 of Zn O bonds in H/D systems changed from 2.007Ato˚ 2.166A˚ going from HHHD to − HHDD and this trend is also seen in the calculated values of these clusters as shown in

Table 5-28. 5.4 Conclusions

This research describes the design, development, and implementation of two programs called pdbSearcher and MCPB. The former carries out metalloprotein data mining of the Protein Data Bank. Results focused on Zinc metalloproteins as a large number of proteins contain this element. The majority of Zn metalloproteins are tetrahedrally coordinated to histidine, cysteine, aspartate, glutamate residues, or water molecules. The distribution of bond lengths between Zn and the donor atoms of these residues was investigated, with some short Zn S bonds highlighted which may be due to errors during crystal structure − refinement.

183 Table 5-27. Histidine and Water’s Partial Charges using ChgModB for the Zn-HHHO, -HHOO, and -HOOO Clusters. Partial Charges are in electron units. Residue CB CG ND1 CD2 HIE -0.007400 0.186800 -0.54320 -0.220700 HID -0.046200 -0.026600 -0.38110 0.129200 HHHO1 (HID1) 0.272769 -0.046469 -0.054651 -0.122505 HHHO1 (HID2) 0.579686 -0.290892 -0.115057 -0.073113 HHHO1 (HIE) 0.563536 -0.126874 -0.196737 -0.099394 HHHO2 (HID1) 0.07163 0.005255 -0.085911 -0.103044 HHHO2 (HID2) 0.379846 -0.232467 -0.086636 -0.024617 HHHO2 (HIE) 0.123745 -0.116227 0.001013 -0.073261 HHOO (HID) 0.564561 -0.334037 -0.048653 0.144851 HOOO (HIE) 0.817801 -0.151877 -0.360409 -0.256153 Residue CE1 NE2 HB2/3 HD1 HIE 0.163500 -0.279500 0.036700 HID 0.205700 -0.572700 0.040200 0.364900 HHHO1 (HID1) -0.153009 -0.160702 -0.015389 0.341095 HHHO1 (HID2) -0.057251 -0.113546 -0.078839 0.323076 HHHO1 (HIE) -0.093972 -0.092858 -0.130733 HHHO2 (HID1) -0.081199 -0.257224 0.038414 0.315667 HHHO2 (HID2) -0.005112 -0.210826 -0.036823 0.297259 HHHO2 (HIE) -0.071842 -0.204022 0.019225 HHOO (HID) 0.023257 -0.445271 -0.063809 0.315238 HOOO (HIE) -0.210361 0.125524 -0.140428 Residue HE1 HE2 HD2 ZN HIE 0.143500 0.333900 0.186200 HID 0.139200 0.114700 HHHO1(HID1) 0.187876 0.122148 0.702584 HHHO1(HID2) 0.210086 0.142360 HHHO1 (HIE) 0.169654 0.345567 0.167258 HHHO2(HID1) 0.186714 0.149779 0.674911 HHHO2(HID2) 0.167127 0.102959 HHHO2 (HIE) 0.176463 0.348448 0.153575 HHOO(HID) 0.176795 0.096861 1.02705 HOOO (HID) 0.191318 0.314799 0.264829 0.968391 Residue O H WAT -0.834 0.417 -OH HHHO (WAT) -0.765313 0.468035 HHHO (HO-) -1.003960 0.411829 HHOO (WAT) -0.742546 0.400146 HOOO (WAT) -0.595794 0.435269

184 Table 5-28. Zn-HHHD and Zn-HHDD Cluster Bond Lengths, r, in A˚ and Force Constants, K in kcal/(mol A˚2) (PDB ID: 2USN and 1U0A). r · Bond r Kr 2USN Zn-NX 2.0729 176.828 Zn-NY 2.0269 208.052 Zn-NZ 2.0276 206.717 Zn-O2 1.9865 282.503 Zn-N Mean 2.0420 197.199

1U0A Zn-NY 2.1247 133.766 Zn-NZ 2.1404 128.104 Zn-OA 2.1660 171.778 Zn-OB 2.0517 209.806 Zn-N Mean 2.1320 130.935 Zn-O Mean 2.1080 190.792

Table 5-29. Zn-HHHD Cluster Angles, θ, in Degrees and Force Constants, Kθ, in kcal/(mol rad2) (PDB ID: 2USN). · Angle Angle Angle Angle N-Zn-NY N-Zn-NZ N-Zn-O CR-N-Zn θ NX 105.439 105.366 95.8076 115.956 NY 120.039 113.247 123.274 NZ 113.189123.299 CV/CC-N-Zn C-O-Zn NX 136.979 NY 130.128 NZ 130.170 O2 100.499 Kθ NX 17.8462 17.4805 11.7020 47.2274 NY 7.20713 13.3154 50.0689 NZ 13.317149.0374 CV/CC-N-Zn C-O-Zn NX 41.7749 NY 45.0768 NZ 43.9566 O2 197.677

185 (a) 2USN Model 1 (b) 2USN Model 2

(c) 1U0A Model 1 (d) 1U0A Model 2

Figure 5-19. Zn-HHHD and Zn-HHDD Cluster Models (PDB ID: 2USN and 1U0A).

The MCPB program was used to build, prototype, and validate AMBER-like force fields using the bonded plus electrostatics model for metalloproteins that can be added to the AMBER suite of programs. MCPB was used to investigate the environmental effects on bond lengths, angles, plus bond and angle force constants using 10 unique metal coordinations. These included Zn bound to CCCC, CCCH, CCHH, CHHH, HHHH, HHHO, HHOO, HOOO, HHHD, and HHDD clusters. A Zinc AMBER Force Field

186 Table 5-30. Zn-HHDD Cluster Angles, θ in Degrees and Force Constants, Kθ, in kcal/(mol rad2) (PDB ID: 1U0A). · Angle Angle Angle Angle N-Zn-NZ N-Zn-OA N-Zn-OB CR-N-Zn θ NY 96.9897 101.475 100.576 121.551 NZ 87.8377 99.8952 117.541 OA 155.527 CC-N-Zn O2-C-O C-O-Zn NY 131.075 NZ 132.416 OA 120.909 88.3673 OB 121.439 94.2323 Kθ NY 20.5679 9.54097 10.8215 54.9830 NZ 16.5019 14.8041 47.8802 OA 8.80214 CC-N-Zn O2-C-O C-O-Zn NY 44.1986 NZ 61.5788 OA 148.620 187.032 OB 183.487 382.989

(ZAFF) library was created to store these FF parameters in a convenient way as to allow later use with different metalloproteins than those used in the parameterization.

This work can have many uses in the future. Mainly the equilibrium bond lengths and angles can be used to aid the refinement of Zn metalloprotein X-ray crystal structures. Also the MCPB program allows for rapid development, limited by the cost of the ab initio or DFT calculations, of FF parameters for metalloproteins which could have many uses in drug design projects for example where the target structure contains a metal ion. This program also provides a platform where non-expert users can develop metalloprotein FF parameters which until now was not available.

187 Table 5-31. Histidine and Aspartate Residue Charges using ChgModB for the Zn-HHHD and -HHDD Clusters. Partial Charges are in electron units. Residue CB CG ND1 CD2 HIE -0.007400 0.186800 -0.54320 -0.220700 HID -0.046200 -0.026600 -0.38110 0.129200 HHHD (HID1) 0.042591 -0.024504 -0.072385 -0.137672 HHHD (HID2) 0.160366 -0.006538 -0.161147 -0.192906 HHHD (HIE) 0.374184 -0.005512 -0.148399 -0.245743 HHDD (HIE) 0.028284 0.076696 -0.306581 -0.169575

Residue CE1 NE2 HB2/3 HD1 HIE 0.163500 -0.279500 0.036700 HID 0.205700 -0.572700 0.040200 0.364900 HHHD (HID1) -0.147686 -0.122517 0.047948 0.313389 HHHD (HID2) -0.066439 -0.063515 0.007072 0.343690 HHHD (HIE) -0.066439 -0.038235 -0.074862 HHDD (HIE) 0.030139 -0.137031 0.039508

Residue HE1 HE2 HD2 ZN HIE 0.143500 0.333900 0.186200 HID 0.139200 0.114700 HHHD(HID1) 0.193188 0.193977 0.431980 HHHD(HID2) 0.179092 0.195708 HHHD (HIE) 0.179092 0.327123 0.195982 HHDD (HIE) 0.124130 0.309302 0.165779 0.919685

Residue CB CG OD1/2 HB2/3 ASP -0.030300 0.799400 -0.801400 -0.012200 HHHD 0.441335 0.373495 -0.493992 -0.112244 HHDD 0.202830 0.429585 -0.607229 0.050221

188 CHAPTER 6 CONCLUSIONS This chapter provides a synopsis of the research presented in this dissertation where computational chemistry tools were successfully developed and applied in areas of drug design and metalloprotein modeling. Chapter two outlined the drug discovery process from the point of view of a computational chemist. The most common methods used to predict the binding free energy between receptor and their ligands were summarized including ligand-based and receptor-based techniques. Statistical and graph theory methods were illustrated and the use of these tools in chemistry was discussed. A general introduction of metalloproteins was also given. The third chapter described the design and development of a computational chemistry package called MTK++. This was designed as a general purpose molecular modeling suite for use in drug design and metalloprotein chemistry studies. This work was essential for the research in the entire thesis to be carried out. Also MTK++ provides a platform for further development of novel algorithms in modeling of small molecules, proteins, and most importantly metalloproteins.

Chapter four details the development and large scale validation of a semiflexible alignment approach using a semiempirical scoring function called CuTieP for in silico drug design. Results were comparable to those of empirical alignment approaches where ligand binding geometries within protein active sites were predicted with an accuracy of around

50%. CuTieP is a physics-based method which has potential to be improved on in terms of speed and accuracy while avoiding the pitfalls of parameter transferability. The final research chapter described the design, development, and implementation of two programs called pdbSearcher and MCPB for the study of metalloproteins. These

programs were used to data mine the Protein Data Bank and develop molecular mechanics force fields for Zinc metalloproteins. A total of 10 unique tetrahedral Zn force fields were

189 built and the properties of these parameters were elucidated. These parameters may be used to address important structure, function and dynamics questions that are not currently attainable using QM and QM/MM based methods.

190 APPENDIX A ALGORITHMS A.1 Subgraph Isomorphism Algorithm

Algorithm A.1: Subgraph Isomorphism Algorithm Data: molecule, fragment Result: mapping if fragment found in molecule begin int Pa; int Pb; array A[Pa,Pa]; array B[Pb,Pb]; array M[Pa, Pb]; bool isomorphism = false; Ullmann(1, M); end

Algorithm A.2: Ullmann Function Data: current atom index, match matrix Result: mapping if fragment found in molecule begin array M1 = M; bool mismatch; for all unique mappings of atom d do choose new unique mapping for query atom d; update M accordingly; refine(M, mismatch); if !mismatch then if d == Pa then isomorphism = true; store M; else Ullmann(d+1, M); end else M = M1; end end end

191 Algorithm A.3: Refine Function Data: match matrix, boolean mismatch Result: mapping if fragment found in molecule begin mismatch = false; bool change = false; while not change or mismatch do for i P a do for→j P b do → if M[i][j] then assign mij; change = change or not mij; end end end assign mismatch; end end

192 A.2 Maximum Common Pharmacophore Algorithm A.4: Find Maximum Common Pharmacophore (MCP) Data: Two Molecules Result: Maximum Common Pharmacophore Between the Two Molecules begin Generate Feature Correspondence Matrix between the two molecules; Get Threshold Feature Score, TFS; bestClqScore = 0; bestClqSize = 0; for i CorrespondenceMatrix do getP→ air 1; curClq i; curClqScore 0; while getP←− air do ←− ←− getP air 0; maxScore 0; pair 0; for j ←−CorrespondenceMatrix←− do ←− testScore→ 0; for k curClq←− do → 2 jkScore = exp( ((dj dk)/dm) ); if jkScore > TFS− then− testScore+= jkScore else break; end end if testScore > maxScore then P air j; maxScore←− testScore; end ←− end if P air then Add P air to curClq; curClqScore+= maxScore; getP air 1; end ←− end store 0; if curClqScore←− > bestClqScore +0.1 then store = 1; end if curClqScore > bestClqScore 0.1 then if curClqSize > bestSize then− store = 1; end if curClqSize == bestSize then if curClqDist > bestDist +0.1 then store = 1; end end end if store then bestClqScore = curClqScore; bestClqSize = curClqSize; bestClqDist = curClqDist; bestClq curClq; end ←− end return bestClq end

193 APPENDIX B AMBER GRADIENTS The uncompressed AMBER energy function has the following form:

E = K (r r )2 + K (θ θ )2 + V [1 + cos(nφ γ)] + total r ij − eq θ ijk − eq n ijkl − n bondsX anglesX dihedralsX X r∗ 12 r∗ 6 1 1−4 r∗ 12 r∗ 6 ε ij 2 ij + ε ij 2 ij + ij r − r V DW ij r − r i

Papers from Blondel and Karplus [319], Swope and Ferguson [320], and Tuzun, Noid and Sumpter [321] in addition to the molecular modeling book by Leach [81] were used as reference in the derivation of the derivatives or gradients of the AMBER function. B.1 Vector Math and Derivatives

The distance, rij, between two atoms, i and j, can be defined as:

r = r r (B–2) ij i − j r = (x x )2 +(y y )2 +(z z )2 (B–3) | ij| i − j i − j i − j q = ((x x )2 +(y y )2 +(z z )2)1/2 (B–4) i − j i − j i − j

= √ri rj (B–5) · r rˆ = ij (B–6) ij r | ij|

where ri is the position of atom i and xi is the x component of the position vector of atom i.

In Cartesian space the differential operator is defined as:

∂ ∂ ∂ i =x ˆ +ˆy +ˆz (B–7) ∇ ∂xi ∂yi ∂zi

194 wherex ˆ is a unit vector parallel to axis of the reference coordinate system.

We want to calculate the ∂E/∂xi where xi is the x coordinate of atom i. It is best to calculate ∂E/∂λ using the chain rule where λ is the internal coordinate.

∂E ∂E ∂λ = (B–8) ∂xi ∂λ ∂xi

dθ d cos θ −1 = (B–9) d cos θ dθ   1 = (B–10) −sin θ

B.2 AMBER First Derivatives

B.2.1 Bond

E = K (r r )2 (B–11) bond r ij − eq where Kr is the bond force constant, rij is the bond length, req is the standard bond length.

∂E E = r (B–12) ∇i ∂r · ∇i r = 2K (r r ) ij (B–13) r ij − eq · r | ij|

∂r 1 2 2 −1/2 = ((xi xj) +(yi yj) +(zi zj)) 2(xi xj) (B–14) ∂xi 2 − − − · − 1 = (x x ) (B–15) r i − j r r = ij (B–16) ∇i r | ij| r = r (B–17) ∇j −∇i

195 B.2.2 Angle

E = K (θ θ )2 (B–18) angle θ − eq where K is the angle force constant, θ is the angle (0 θ π), θ is the standard angle. θ ≤ ≤ eq

r r θ = arccos ij · kj (B–19) rij rkj r r | || | cos θ = ij · kj (B–20) r r | ij|| kj|

∂E ∂θ E = cos θ (B–21) ∇i ∂θ · ∂ cos θ · ∇i 1 = 2K (θ θ ) cos θ (B–22) θ − eq · − sin θ · ∇i   (B–23)

How to determine cos θ? ∇i

Considering that cos θ is a function of rij and rkj both of which are functions of ri, rj,

and rk. Therefore you need to use the chain rule:

∂ cos θ =x ˆ i cos θ (B–24) ∂xi ∇ ∂ cos θ ∂x ∂ cos θ ∂y ∂ cos θ ∂z = ij + ij + ij + ∂xij ∂xi ∂yij ∂xi ∂zij ∂xi ∂ cos θ ∂x ∂ cos θ ∂y ∂ cos θ ∂z kj + kj + kj (B–25) ∂xkj ∂xi ∂ykj ∂xi ∂zkj ∂xi ∂ cos θ = (B–26) ∂xij

There is a total of 9 such expression similar to eq. B–26 which lead to the following

196 ∂ cos θ ∂ cos θ ∂ cos θ i cos θ =x ˆ +ˆy +ˆz (B–27) ∇ ∂xij ∂yij ∂zij ∂ cos θ ∂ cos θ ∂ cos θ ∂ cos θ cos θ =x ˆ + +ˆy + + ∇j ∂x ∂x ∂y ∂y  ij kj   ij kj  ∂ cos θ ∂ cos θ zˆ + (B–28) ∂z ∂z  ij kj  ∂ cos θ ∂ cos θ ∂ cos θ k cos θ =x ˆ +ˆy +ˆz (B–29) ∇ ∂xkj ∂ykj ∂zkj

Finally, how do you determine d cos θ/dxij?

r r cos θ = ij · kj (B–30) r r | ij|| kj|

d(rij ·rkj ) d(|rij ||rkj |) rij rkj (rij rkj) ∂ cos θ | || | dxij − · dxij = 2 2 ∂xij rij rkj | | | | rij rij rkj rkj (rij rkj) rkj | || | − · |rij | | | = 2 2 rij rkj | | | | rij r (rij rkj) |r | = kj · ij r r − r 2 r | ij|| kj| | ij| | kj| 1 r (r r ) r = kj ij · kj ij r r − r r r | ij| | kj| | ij|| kj| | ij| 1 = [ˆr cos θrˆ ] (B–31) r kj − ij | ij| ∂ cos θ 1 = [ˆr cos θrˆ ] (B–32) ∂x r ij − kj kj | kj|

B.2.3 Dihedral

E = K [1 + cos(nφ γ)] (B–33) dihedral φ − where K is the torsional constant, φ is the dihedral angle ( π 0 π), γ is the standard φ − ≤ ≤ dihedral angle.

197 t u cos φ = · (B–34) t u | || | where:

r = r r ij i − j r = r r kj k − j r = r r lk l − k t = r r (B–35) ij × kj u = r r (B–36) lk × kj

∂E ∂φ E = cos φ (B–37) ∇i ∂φ · ∂ cos φ · ∇i 1 = nK sin (nφ φ ) cos φ (B–38) φ − eq · − sin φ · ∇i   How to determine cos φ? ∇i

Considering that cos φijkl is a function of t and u, both of which are functions of rij,

rkj, and rlk, which are in turn functions of ri, rj, rk, rl. Therefore you need to use the chain rule:

∂ cos φ =x ˆ i cos φ (B–39) ∂xi ∇ ∂ cos φ ∂t ∂ cos φ ∂t ∂ cos φ ∂t = x + y + z + ∂tx ∂xi ∂ty ∂xi ∂tz ∂xi ∂ cos φ ∂u ∂ cos φ ∂u ∂ cos φ ∂u x + y + z (B–40) ∂ux ∂xi ∂uy ∂xi ∂uz ∂xi ∂ cos φ ∂ cos φ = ( zkj)+ (ykj) (B–41) ∂ty − ∂tz where:

198 t = (r r ) (r r ) (B–42) i − j × k − j = ((y z y z ) , (z x z x ) , ij ∗ kj − kj ∗ ij ij ∗ kj − kj ∗ ij (x y x y )) (B–43) ij ∗ kj − kj ∗ ij t = (y z y z ) (B–44) x ij ∗ kj − kj ∗ ij ∂t ∂ (y z y z ) x = ij ∗ kj − kj ∗ ij =0 (B–45) ∂xi ∂xi ∂ty = zkj (B–46) ∂xi −

There is a total of 72 such expression similar to eq. B–46.

Determine ∂ cos φ/∂tx (result from angle derivation)?

∂ cos φ 1 u t = x cos φ x (B–47) ∂t t u − t x | | | | | | However, there is a problem with this result due to the 1/ sin φ in Eq. B–38. There are singularities when φ =0, π. Therefore, ∂ cos φ/∂tx needs to be rewritten.

199 t u cos φ = · (B–48) ∇t ∇t t u | || | 1 t u 1 = (t u)+ · (B–49) t u ∇t · u ∇t t | ||u | t u | | | | = · tˆ (B–50) t u − t 2 u | || | | | | | 1 = uˆ tˆ uˆ tˆ (B–51) t − · | | 1    = tˆ tˆ uˆ tˆ uˆ tˆ (B–52) t · − · | | 1     = tˆ uˆ tˆ (B–53) t × × | | 1   = tˆ ( sin φrˆ ) (B–54) t × − kj | | 1   = tˆ (ˆr ) (B–55) t × kj | | 1   cos φ = [ˆu (sin φrˆ )] (B–56) ∇u u × kj | | 1 = [ˆu (ˆr )] (B–57) − u × kj | |

200 Finally,

∂φ ∂φ ∂φ ∂φ φ =x ˆ ( z )+ (y ) +ˆy (z )+ ( x ) + ∇i ∂t − kj ∂t kj ∂t kj ∂t − kj  y z   x z  ∂φ ∂φ zˆ ( y )+ (x ) (B–58) ∂t − kj ∂t kj  x y  ∂φ ∂φ ∂φ ∂φ φ =x ˆ (z z )+ (y y )+ ( z )+ (y ) + ∇j ∂t kj − ij ∂t ij − kj ∂u − lk ∂u lk  y z y z  ∂φ ∂φ ∂φ ∂φ yˆ (z z )+ (y x )+ (z )+ ( x ) + ∂t ij − kl ∂t kj − ij ∂u lk ∂u − lk  x z x z  ∂φ ∂φ ∂φ ∂φ zˆ (y y )+ (x x )+ ( y )+ (x ) (B–59) ∂t kj − ij ∂t ij − kj ∂u − lk ∂u lk  x y x y  ∂φ ∂φ ∂φ ∂φ φ =x ˆ (z )+ ( y )+ (z + z )+ ( y y ) + ∇k ∂t ij ∂t − ij ∂u lk kj ∂u − kj − lk  y z y z  ∂φ ∂φ ∂φ ∂φ yˆ ( z )+ (x )+ ( z z )+ (x + x ) + ∂t − ij ∂t ij ∂u − kj − lk ∂u lk kj  x z x z  ∂φ ∂φ ∂φ ∂φ zˆ (y )+ ( x )+ (y + y )+ ( x x ) (B–60) ∂t ij ∂t − ij ∂u lk kj ∂u − kj − lk  x y x y  ∂φ ∂φ ∂φ ∂φ φ =x ˆ ( z )+ (y ) +ˆy (z )+ ( x ) + ∇l ∂u − kj ∂u kj ∂u kj ∂u − kj  y z   x z  ∂φ ∂φ zˆ ( y )+ (x ) (B–61) ∂u − kj ∂u kj  x y 

B.2.4 Electrostatic

qiqj Eele = (B–62) εrij

∂E qiqj = 2 (B–63) ∂rij εrij

201 B.2.5 van der Waals

r∗ 12 r∗ 6 E = ε ij 2 ij (B–64) vdw ij r − r " ij   ij  # 1 1 ε r∗12 2r∗6 (B–65) ij ij r12 − ij r6  ij ij 

∂E r∗12 r∗6 = ε 12 ij + 12 ij (B–66) ∂r ij − r13 r7 ij  ij ij  12 r∗12 12 r∗6 ε ij + ij (B–67) ij −r r12 r r6  ij ij ij ij  12ε r∗12 r∗6 ij ij ij (B–68) − r r12 − r6 ij  ij ij  2 12εij D D /rij (B–69) −∗6 − rij   D = 6 (B–70) rij

202 APPENDIX C FRAGMENT LIBRARY C.1 Terminal Fragments

Table C-1. Terminal Fragments.

Terminal Fragments have one connection point Structure Name 3L 8L Chg 1 R Methyl CH3 TF000CH3 0 2 R Ethyl ETH TF000ETH 0 3 R Propyl PYL TF000PYL 0

R 4 Isopropyl IPL TF000IPL 0 5 R n-butyl NBL TF000NBL 0

6 R sec-butyl SBL TF000SBL 0 R 7 Isobutyl IBL TF000IBL 0

R 8 tert-butyl TBL TF000TBL 0 H H

9 R H Vinyl AK1 TF000AK1 0 10 R H Alkynl TLK TF000TLK 0 1 R NH2 Primary amine PAM TF000PAM 0 + 2 R NH3 NH3 NH3 TF000NH3 +1 H N 3 R NHMe NHM TF000NHM 0 H N 4 R NHEt NHE TF000NHE 0 H N 5 R NHPr NHP TF000NHP 0 H N R 6 NHi-Pr NHI TF000NHI 0

RN 7 NMe2 NM2 TF000NM2 0 H N 8 R H Primary aldimine PAD TF000PAD 0 9 R C N Nitrile NIT TF000NIT 0 Continued on next page.

203 Table C-1. (continued)

Structure Name 3L 8L Chg

10 RN C Isonitrile INI TF000INT 0 11 RN N N Azide AZD TF000AZD 0 12 RN N Diazonium DZM TF000DZM +1 H H N N H H 13 R Amidine AMD TF000AMD +1 1 R OH Primary alcohol POL TF000POL 0 R O 2 H Alcohol ARA TF000ARA 0 O 3 R Methyl ether OME TF000OME 0 O 4 R Ethyl ether OET TF000OET 0 O 5 R Propyl ether OPR TF000OPR 0

O 6 R Isopropyl ether OIP TF000OIP 0 O

7 R1 H Aldehyde ALD TF000ALD 0 O 8 R OH Carboxylic acid CAA TF000CAA 0 O 9 R O Carboxylate CAR TF000CAR -1 O R OH 10 OH Alphahydroxy acid AHA TF000AHA 0 O 11 R1 OH Hydroperoxide HPO TF000HPO 0 O 12 R F Acyl fluoride ACF TF000ACF 0 O 13 R Cl Acyl Chloride ACC TF000ACC 0 O 14 R Br Acyl Bromide ACB TF000ACB 0 O 15 R I Acyl Iodide ACI TF000ACI 0 1 R SH Thiol RSH TF000RSH 0 Continued on next page.

204 Table C-1. (continued)

Structure Name 3L 8L Chg S 2 R Methyl thioether SME TF000SET 0 S 3 R Ethyl thioether SET TF000SET 0 S 4 R Propyl thioether SPR TF000SPR 0

S 5 R Isopropyl thioether SIP TF000SIP 0 S

6 R1 H Thioaldehyde TAD TF000TAD 0 S 7 R F Sulfenic acid fluoride SAF TF000SAF 0 S 8 R Cl Sulfenic acid chloride SAC TF000SAC 0 S 9 R Br Sulfenic acid bromide SAB TF000SAB 0 S 10 R I Sulfenic acid iodide SAI TF000SAI 0 1 RF Fluoride FL1 TF000FL1 0 2 R Cl Chloride CL1 TF000CL1 0 3 R Br Bromide BR1 TF000BR1 0 4 RI Iodide IO1 TF000IO1 0 5 R CF3 Trifluromethyl 3FM TF0003FM 0 6 R CCl3 Trichloromethyl 3CM TF0003CM 0 O

1 R NH2 Primary amide PMD TF000PMD 0 O 2 R N N N Carboxylic acid azide CAZ TF000CAZ 0 O OH R N 3 H Hydroxamic acid HOA TF000HOA 0 O

R C 4 N Acyl cyanide ACY TF000ACY 0 5 R OC N Cyanate CYN TF000CYN 0 6 R SC N Thiocyanate TCY TF000TCY 0 7 RN C O Isocyanate ICY TF000ICY 0 8 RN C S Isothiocyanate ITC TF000ITC 0 9 RN O Nitroso NRS TF000NRS 0 Continued on next page.

205 Table C-1. (continued)

Structure Name 3L 8L Chg

O RN 10 O Nitro NRO TF000NRO 0 11 R ON O Nitrite NRT TF000NRT 0 O R ON 12 O Nitrate NRA TF000NRA 0 O R S OH 13 O Sulfonic acid SNA TF000SNA 0 O S 15 R1 OH Sulfinic acid SIA TF000SIA 0 S 16 R OH Sulfenic acid SEA TF000SEA 0 OH B 17 R OH Boronic acid BOA TF000BOA 0 S 18 R O Thio-carboxylate TCA TF000TCA -1 S 19 R S Dithio-carboxylate DTC TF000DTC -1 S 20 R OH Thio-carboxylic acid TCO TF000TCO 0 S 21 R SH Dithio-carboxylic acid TCS TF000TCS 0 O R S F 22 O Sulfonic acid fluoride SOF TF000SOF 0 O R S Cl 23 O Sulfonic acid chloride SOC TF000SOC 0 O R S Br 24 O Sulfonic acid bromide SOB TF000SOB 0 O R S I 25 O Sulfonic acid iodide SOI TF000SOI 0 O S 26 R F Sulfinic acid fluoride SIF TF000SIF 0 O S 27 R Cl Sulfinic acid chloride SIC TF000SIC 0 Continued on next page.

206 Table C-1. (continued)

Structure Name 3L 8L Chg O S 28 R Br Sulfinic acid bromide SIB TF000SIB 0 O S 29 R I Sulfinic acid iodide SII TF000SII 0 O P R OH 30 OH Phosphonic acid PPA TF000PPA 0 O O P OH 31 R OH Phosphoric acid PP2 TF000PP2 0 O O P O- 32 R O- Phosphate PP3 TF000PP3 0 Terminal Fragments.

207 C.2 Two Point Linker Fragments

Table C-2. Two Point Linker Fragments.

Two point linker fragments have two connection points Structure Name 3L 8L Chg 1 R1 CH2 R2 Ethyl LYH 2PL00LYH 0

H R2

2 R1 H Trans-alkene AK2 2PL00AK2 0 H H

3 R1 R2 Cis-Alkene CIS 2PL00CIS 0 H H

4 R1 R2 Geminal-Alkene AK3 2PL00AK3 0 5 R1 R2 Alkyne AKY 2PL00AKY 0 H N 1 R1 R2 Sec-amine SAM 2PL00SAM 0 R N 2

2 R1 H Sec-aldimine SAD 2PL00SAD 0 H N

3 R1 R2 Prim-ketimine PKT 2PL00PKT 0 N R1 R2 4 N Azo AZO 2PL00AZO 0 5 R1 N C N R2 Carbo-diimide DII 2PL00DII 0 R N 2

6 R1 F Imidoyl fluoride IMF 2PL00IMF 0 R N 2

7 R1 Cl Imidoyl chloride IMC 2PL00IMC 0 R N 2

8 R1 Br Imidoyl bromide IMB 2PL00IMB 0 R N 2

9 R1 I Imidoyl iodide III 2PL00III 0 O

1 R1 R2 Ketone KET 2PL00KET 0 O 2 R1 R2 Ethoxy ETO 2PL00ETO 0 Continued on next page.

208 Table C-2. (continued)

Structure Name 3L 8L Chg O C

3 R1 R2 Ketene KTE 2PL00KTE 0

R2 4 R1 OH Sec-alcohol SAL 2PL00SAL 0 HO OH

5 R1 R2 Enediol END 2PL00END 0 O 6 R1 R2 Ether ETR 2PL00ETR 0

O R2 7 R1 O Peroxide PXD 2PL00PXD 0 O

R2 8 R1 O Carboxylic acid ester CAE 2PL00CAE 0 O O

9 R1 O R2 Anhydride ANH 2PL00ANH 0 S 1 R1 R2 Thio ether STR 2PL00STR 0

S R2 2 R1 S Disulfide DIS 2PL00DIS 0 S

3 R1 R2 Thio ketone TKT 2PL00TKT 0 O

R2 R1 N 1 H Sec-amide AMI 2PL00AMI 0 OH N

2 R1 R2 Oxime OXM 2PL00OXM 0 O R 1 OH NH 3 R2 Alpha amino acid AAA 2PL00AAA 0 O R 1 N OH 4 R2 Carbamic acid CBA 2PL00CBA 0 O R 1 N F 5 R2 Carbamic acid floride CBF 2PL00CBF 0 O R 1 N Cl 6 R2 Carbamic acid chloride CBC 2PL00CBC 0 Continued on next page.

209 Table C-2. (continued)

Structure Name 3L 8L Chg O R 1 N Br 7 R2 Carbamic acid bromide CBB 2PL00CBB 0 O R 1 N I 8 R2 Carbamic acid iodide CBI 2PL00CBI 0 O

R1 S R2 1 O Sulfone SO2 2PL00SO2 0 O S 2 R1 R2 Sulfoxide SLF 2PL00SLF 0 O S 3 R1 OR2 Sulfinic acid ester SLE 2PL00SLE 0 O

R1 O S O R2 4 O Sulfuric acid ester SFE 2PL00SFE 0 S R2 5 R1 O Thio carboxylic acid ester TCE 2PL00TCE 0 S R2 6 R1 S Dithio carboxylic acid ester DCE 2PL00DCE 0 S R 1 N F 7 R2 Thio carbamic acid fluoride TBF 2PL00TBF 0 S R 1 N Cl 8 R2 Thio carbamic acid chloride TBC 2PL00TBC 0 S R 1 N Br 9 R2 Thio carbamic acid bromide TBB 2PL00TBB 0 S R 1 N I 10 R2 Thio carbamic acid iodide TBI 2PL00TBI 0 S R 1 N OH 11 R2 Thio carbamic acid TBA 2PL00TBA 0

O R2 HO S N 12 O R1 Sulfuric acid amide SAA 2PL00SAA 0 Continued on next page.

210 Table C-2. (continued)

Structure Name 3L 8L Chg O

R1 S OR2 13 O Sulfonic acid ester SAE 2PL00SAE 0 S 14 R1 OR2 sulfenic acid ester SEE 2PL00SEE 0 O R2 P R O 15 1 O- Phosphonic acid ester PP4 TF000PP4 0 Two Point Linker Fragments.

211 C.3 Three Point Linker Fragments

Table C-3. Three Point Linker Fragments.

Three point linker fragments have three connection points Structure Name 3L 8L Chg H R3 1 R1 R2 Propyl ROP 3PL00ROP 0

H R2

2 R1 R3 Alkene AK4 3PL00AK4 0

R3 N 1 R1 R2 Tertiary amine TTA 3PL00TTA 0

R2 N 2 R1 R3 Tertiary amine Et TT1 3PL00TT1 0 R N 3

3 R1 R2 Secondary ketimine SKT 3PL00SKT 0

R2 R3 1 R1 OH Tertiary alcohol TAL 3PL00TAL 0 OH

R1 R3 2 R2 Enol ENL 3PL00ENL 0

R3O OH 3 R1 R2 Hemiacetal HEM 3PL00HEM 0 OR N 3

1 R1 R2 Oxime ether OXE 3PL00OXE 0

O R N 3 2 R1 R2 N-oxide NOX 3PL00NOX 0

R2 N R3 3 R1 O Hydroxylamine HXA 3PL00HXA 0 O

R2 R1 N 4 R3 Tertiary amide TAM 3PL00TAM 0 R N 3

R2 5 R1 O Imido ester IDE 3PL00IDE 0 O O

R1 N R2 6 R3 Imide IME 3PL00IME 0 Continued on next page.

212 Table C-3. (continued)

Structure Name 3L 8L Chg O

R1 N OR3 7 R2 Carbamic acid ester CAD 3PL00CAD 0 O R 1 OH NH 8 R2 Alpha amino acid ADB 3PL00ADB 0 S

R2 R1 N 1 R3 Thiotertiary amide TIA 3PL00TIA 0 R N 3

R2 2 R1 S Imidothioester ITE 3PL00ITE 0 S

R1 N OR3 3 R2 Thiocarbamic acid ester ARB 3PL00ARB 0

O R3 R1 O S N 4 O R2 Sulfuric acid amide ester AAE 3PL00AAE 0

O R3 R1 S N 5 O R2 Sulfonamide SFA 3PL00SFA 0

S R3 R1 N 6 R2 Sulfenic acid amide SLA 3PL00SLA 0 O

S R3 R1 N 7 R2 Sulfinic acid amide SIM 3PL00SIM 0

R3 P 1 R1 R2 Phosphine PHI 3PL00PHI 0 O P R R1 3 2 R2 Phosphinoxide PHO 3PL00PHO 0 Three Point Linker Fragments.

213 C.4 Four Point Linker Fragments

Table C-4. Four Point Linker Fragments.

Four point linker fragments have four connection points Structure Name 3L 8L Chg

R4 R3 1 R1 R2 Butyl BUT 4PL00BUT 0

R4 R1 R3 2 R2 Alkene AK5 4PL00AK5 0

R3 N N R4

1 R1 R2 Hydrazone HZO 4PL00HZO 0

R2 N R3 R1 N 2 R4 Hydrazine HZI 4PL00HZI 0 R N 4

R3 R1 N 3 R2 Amidine AME 4PL00AME 0 R 3 R N 4 4 R1 R2 Quaternary ammonium QTA 4PL00QTA +1

OR4 R1 R3 1 R2 Enol ether ELE 4PL00ELE 0 HO OH

R1 R3 2 R2 R4 1,2-diol 12D 4PL0012D 0

R3O OR4 3 R1 R2 Acetal ACE 4PL00ACE 0

HO NH2

R1 R3 1 R2 R4 1,2-aminoalcohol 12A 4PL0012A 0

O R4

R1 N R3 2 R2 Carboxylic acid hydrazide CAH 4PL00CAH 0 O R R 1 N N 4 3 R2 R3 Urea URE 4PL00URE 0 Continued on next page.

214 Table C-4. (continued)

Structure Name 3L 8L Chg R O 4 R 1 N N 4 R2 R3 Isourea IUR 4PL00IUR 0

R3S SR4 1 R1 R2 Thio acetal CET 4PL00CET 0 S R R 1 N N 4 2 R2 R3 Thiourea TIU 4PL00TIU 0 R S 4 R 1 N N 3 R2 R3 Isothiourea ITU 4PL00ITU 0

R1 O R4 N S N R 4 2 O R3 Sulfuric acid diamide DIA 4PL00DIA 0 Four Point Linker Fragments.

215 C.5 Five Point Linker Fragments

Table C-5. Five Point Linker Fragments.

Five point linker fragments have five connection points Structure Name 3L 8L Chg R N 5 R R 1 N N 4 1 R2 R3 Guanidine GUD 5PL00GUD 0

R5 N R4 N R1 N R3 2 R2 Amidrazone ADZ 5PL00ADZ 0 R R 5 N 4

R1 R3 3 R2 Enamine ENM 5PL00ENM 0

R4 R3O N R5 4 R1 R2 Hemiaminal HMI 5PL00HMI 0

R4 R3S N R5 5 R1 R2 Thiohemiaminal THI 5PL00THI 0

R4 O R5 N N R3 6 R1 R2 Semicarbazone SCZ 5PL00SCZ 0

R4 S R5 N N R3 7 R1 R2 Thiosemicarbazone TSZ 5PL00TSZ 0

O R5 R1 N N N R4 8 R2 R3 Semicarbazide SCI 5PL00SCI 0

S R5 R1 N N N R4 9 R2 R3 Thiosemicarbazide TSI 5PL00TSI 0 Five Point Linker Fragments.

216 C.6 Three Membered Ring Fragments

Table C-6. Three Membered Ring Fragments.

Three Membered Ring Fragments Structure Name 3L 8L Chg

1 R Cyclopropyl CPP 3MR00CPP 0

R3 R4 R2 R5 N 2 R1 1,2,2,3,3-aziridine AZI 3MR00AZI 0

R2 R3 R1 R4 3 O 2,2,3,3-epoxide EPO 3MR00EPO 0

R2 R3 R1 R4 4 S 2,2,3,3-thiirane TII 3MR00TII 0 Three Membered Ring Fragments.

217 C.7 Four Membered Ring Fragments

Table C-7. Four Membered Ring Fragments.

Four Membered Ring Fragments Structure Name 3L 8L Chg

1 R Cyclobutane CBT 4MR00CBT 0

R2 2 R1 1,1-cyclobutane 1BT 4MR001BT 0

R5 R4 R6 R3

R7 R2 3 R8 R1 1,1,2,2,3,3,4,4-cyclobutane 2BT 4MR002BT 0

R5 R4 R6 R3 R 7 O 4 R8 2,2,3,3,4,4-cyclobutane-1-one 4BT 4MR004BT 0

R4 R5 R3 R6 R 2 NH 1 R1 2,2,3,3,4,4-azetidine 4BX 4MR004BX 0 R R3 4 O R 2 O 2 R1 3,3,4,4-betalactone 4BO 4MR004BO 0 R R3 4 O R 2 NH 3 R1 3,3,4,4-betalactam 4BA 4MR004BA 0 R R3 4 OH R 2 N 4 R1 3,3,4,4-beta lactim 4BI 4MR004BI 0

R4 R5 R3 R6 R 2 O 1 R1 2,2,3,3,4,4-oxetane OTE 4MR00OTE 0 Four Membered Ring Fragments.

218 C.8 Five Membered Ring Fragments

Table C-8. Five Membered Ring Fragments.

Five Membered Ring Fragments Structure Name 3L 8L Chg

R2 1 R1 1,2-cyclopentadiene PT1 5MR00PT1 0

R3

2 R1 1,3-cyclopentadiene PT2 5MR00PT2 0

R4

3 R1 1,4-cyclopentadiene PT3 5MR00PT3 0

R3

4 R2 2,3-cyclopentadiene PT4 5MR00PT4 0

R2 5 O 2-cyclopentan-1-one PT5 5MR00PT5 0 R

1 Cyclopentyl CPL 5MR00CPL 0

HN

R3 1 R2 2,3- YR1 5MR00YR1 0

R4

R1 N 2 R3 1,3,4-pyrrole YR2 5MR00YR2 0

R5 R4

R1 N

R3 3 R2 1,2,3,4,5-pyrrole YR3 5MR00YR3 0

R5 R4 HN

R3 4 R2 2,3,4,5-pyrrole YR4 5MR00YR4 0

R5

R1 N N 1 R3 1,3,5- PRZ 5MR00PRZ 0

R1 N 1 N 1-pyrazoline RA1 5MR00RA1 0 Continued on next page.

219 Table C-8. (continued)

Structure Name 3L 8L Chg

HN N 2 R3 3-pyrazoline RA2 5MR00RA2 0

R1 N N 3 R3 1,3-pyrazoline RA3 5MR00RA3 0

R N N 1 H 1-pyrazolidine RI1 5MR00RI1 0 O R4

R1 N N 2 H O 1,4-pyrazolidine-3,5-dione RI2 5MR00RI2 0

R4 R N 1 N 1 R2 1,2,4-imidazole IDZ 5MR00IDZ 0

R N 1 N 3-imidazoline IZ1 5MR00IZ1 0

R N 1 NH 1-imidazolidine IM1 5MR00IM1 0

R N 1 N 2 R3 1,3-imidazolidine IM2 5MR00IM2 0

R N NH 3 O 1-imidazolidinone IM3 5MR00IM3 0

R N 1 N R3 4 O 1,3-imidazolidinone IM4 5MR00IM4 0 O R N 1 N R3 5 O 1,3-imidazoline-2,4-dione IM5 5MR00IM5 0

R5 R4

R1 N 1 N N 1,4,5-1,2,3-triazole AZ1 5MR00AZ1 0

R5 N R1 N N 2 R3 1,3,5-pyrrodiazole PRR 5MR00PRR 0 H N N R 1 N N 1H- TZ1 5MR00TZ1 0 Continued on next page.

220 Table C-8. (continued)

Structure Name 3L 8L Chg

N N R 2 N N Tetrazole TZ2 5MR00TZ2 0

N N R N 1 N N Pentazole PZ1 5MR00PZ1 0

R N 1 1-pyrrolidine RD1 5MR00RD1 0 O

R N 2 1-pyrrolidone RD2 5MR00RD2 0 O

R N

3 O 1-pyrrolidine-2,5-dione RD3 5MR00RD3 0

O

1 R2 2- FN1 5MR00FN1 0

O 2 R3 3-furan FN2 5MR00FN2 0

O

R3 3 R2 2,3-furan FN3 5MR00FN3 0

R4 O 4 R3 3,4-furan FN4 5MR00FN4 0

R O 5 O 5-R-gammalactone FN5 5MR00FN5 0

R5a R5b R4b R4a O O 1 O 4,4,5,5-1,3-dioxolan-2-one XL1 5MR00XL1 0

R5 R4 O N 1 R2 2,4,5-oxazole OZO 5MR00OZO 0

R5

O N 1 R3 3,5-isoxazole IOZ 5MR00IOZ 0 Continued on next page.

221 Table C-8. (continued)

Structure Name 3L 8L Chg

O N 1 R2 2-oxazoline ZO1 5MR00ZO1 0 O O N 2 R2 2-1,3-oxazol-4-one ZO2 5MR00ZO2 0

R5

O N R3 1 O 3,5-oxazolidinone AO1 5MR00AO1 0

R5a R5b R4b R4a O N R3 2 O 3,4,4,5,5-oxazolidinone AO2 5MR00AO2 0

R5 N O N 1 R2 2,5-1,3,4-oxadiazole DZ1 5MR00DZ1 0

N R3 O N 2 R4 3,4-1,2,5-oxadiazole DZ2 5MR00DZ2 0

S 1 R3 3-thiophene 3TP 5MR003TP 0

S

R3 2 R2 2,3-thiophene 23T 5MR0023T 0

R4 S 3 R3 3,4-thiophene 34T 5MR0034T 0

R5

S

4 R2 2,5-thiophene 25T 5MR0025T 0

R5 R4 S 1 N 4,5-thiazole TZL 5MR00TZL 0

R5 R4 S N 2 R2 2,4,5-thiazole TIZ 5MR00TIZ 0 Continued on next page.

222 Table C-8. (continued)

Structure Name 3L 8L Chg

S N 1 R2 2-thiazoline ZL1 5MR00ZL1 0

R4 S NH 1 R2 2,4-1,3-thiazolidine IL1 5MR00IL1 0

S NH 2 R2 2-1,3-thiazolidine IL2 5MR00IL2 0

R4 S N R3 3 R2 2,3,4-1,3-thiazolidine IL3 5MR00IL3 0

R5 O S NH 4 O 5-thiazolidinedione TLD 5MR00TLD 0

R5 R4 S 1 N N 4,5-1,2,3-thiadiazole DI1 5MR00DI1 0

R5

S 2 N N 5-1,2,3-thiadiazole DI2 5MR00DI2 0

N R3 S N 3 R4 3,4-1,2,5-thiadiazole DI3 5MR00DI3 0

R5 N S N 4 R2 2,5-1,3,4-thiadiazole DI4 5MR00DI4 0 Five Membered Ring Fragments.

223 C.9 Six Membered Ring Fragments

Table C-9. Six Membered Ring Fragments.

Six Membered Ring Fragments Structure Name 3L 8L Chg R

1 Phenyl BNZ 6MR00BNZ 0

R1 H R2

H H 2 H 1,2-phenyl (ortho) OSB 6MR00OSB 0

R1 H H

H R2 3 H 1,3-phenyl (meta) MSB 6MR00MSB 0

R1 H H

H H 4 R2 1,4-phenyl (para) PSB 6MR00PSB 0

R1 R2

R5 R3 5 R4 1,2,3,4,5-phenyl 4BZ 6MR004BZ 0 R

1 Cyclohexyl 6CH 6MR006CH 0

R1 R2 2 1,1-cyclohexane 11C 6MR0011C 0

R1

3 R2 1,2-cyclohexene 12C 6MR0012C 0 N R 1 2-pyridine 2PY 6MR002PY 0 N

2 R 3-pyridine 3PY 6MR003PY 0 N

3 R 4-pyridine 4PY 6MR004PY 0 Continued on next page.

224 Table C-9. (continued)

Structure Name 3L 8L Chg

N R2

4 R5 2,5-pyridine PY2 6MR00PY2 0 N

R3 5 R4 3,4-pyridine PY3 6MR00PY3 0 R N 1 1-piperidine 1PP 6MR001PP 0

N R2

1 N 2- ZI1 6MR00ZI1 0

R6 N R2

2 R5 N R3 2,3,5,6-pyrazine ZI2 6MR00ZI2 0 R N

N 1 H 1-piperazine PPZ 6MR00PPZ 0

R1 N

N 2 R4 1,4-piperazine PZ2 6MR00PZ2 0 R N 1 N 1-pyrimidine MI1 6MR00MI1 0

R1 N O

NH R5 2 O 1,5-pyrimidine-2,4-dione MI2 6MR00MI2 0 N N

R3 1 R4 3,4- ZA1 6MR00ZA1 0 R N 6 N

2 R3 3,6-pyridazine ZA2 6MR00ZA2 0 R N 6 N

R5 R3 3 R4 3,4,5,6-pyridazine ZA3 6MR00ZA3 0 Continued on next page.

225 Table C-9. (continued)

Structure Name 3L 8L Chg R N R 6 N 2 4 O 2,6-pyridazin-3-one ZA4 6MR00ZA4 0

N R2

NN

1 R4 2,4-1,3,5-triazine A11 6MR00A11 0

R6 N R2

NN

2 R4 2,4,6-1,3,5-triazine A12 6MR00A12 0 O

HN NH

O O 1 R5a R5b 5,5-barbituric acid B11 6MR00B11 0 O R 1 N NH

O O 2 R5a R5b 1,5,5-barbituric acid B12 6MR00B12 0 OH 1 2-phenol 2PO 6MR002PO 0 OH

2 OH 3,4-diphenol 3PO 6MR003PO 0 R O 3 2-diphenylether DPE 6MR00DPE 0 R

4 -O O 1,4-benzoate 14B 6MR0014B 0

R1

O O 5 R2 1,4-benzoate ester BE1 6MR00BE1 0

R4 R5 R3

R6 R2 6 -O O 2,3,4,5,6-benzoate 1B4 6MR001B4 0 Continued on next page.

226 Table C-9. (continued)

Structure Name 3L 8L Chg

R4 R5 R3

R6 R2

O O 7 R7 2,3,4,5,6-benzoate ester BE2 6MR00BE2 0

O R2 1 2-4H-pyran C11 6MR00C11 0

R6 O R2

R5 R3 2 R4a R4b 2,3,4,4,5,6-4H-pyran C12 6MR00C12 0 R

1 O 4-oxane D11 6MR00D11 0

R6 R5 R7 R4 R8 R3 R R 9 O 2 2 R10 R1 2,2,3,3,4,4,5,5,6,6-oxane D12 6MR00D12 0 R O OH

HO OH 3 OH alpha-D-glucose ADG 6MR00ADG 0 R O R

HO OH 4 OH alpha-*D-glucose ASD 6MR00ASD 0 R O R

HO R 5 OH 2-deoxy-alpha-*D-glucose 2DA 6MR002DA 0 R N

1 O Morpholine MOR 6MR00MOR 0 R N

2 S Thiomorpholine MR1 6MR00MR1 0 O

O 1 R 6-R-delta-lactone E11 6MR00E11 0 Continued on next page.

227 Table C-9. (continued)

Structure Name 3L 8L Chg O R N 2 6-R-delta-lactam E12 6MR00E12 0 R N O

3 2-pyridone 2YO 6MR002YO 0 R N S

4 2-thiopyridone 2TP 6MR002TP 0

R1 N N R2 5 2-iminopyridine 2IP 6MR002IP 0 Six Membered Ring Fragments.

228 C.10 Greater than Six Membered Ring Fragments

Table C-10. Greater than Six Membered Ring Fragments.

Greater than Six Membered Ring Fragments Structure Name 3L 8L Chg R

1 Cycloheptyl G61 GMR00G61 0 R

2 Cyclooctyl G62 GMR00G62 0 Greater than Six Membered Ring Fragments.

229 C.11 Fused Ring Fragments

Table C-11. Fused Ring Fragments.

Fused Ring Fragments Structure Name 3L 8L Chg R

1 1-naphthalene NAP FR000NAP 0 R

1 1-indan ID1 FR000ID1 0 R

2 1-inden-1-yl ID2 FR000ID2 0

R5 R4

1 R7 N R2 2,4,5,7-quinoline QE1 FR000QE1 0

R4 R6

2 N R2 2,4,6-quinoline QE2 FR000QE2 0

R5 R4 R6 R3

R7 N R2 3 R8 2,3,4,5,6,7,8-quinoline QE3 FR000QE3 0

R3

N

1 R1 1,3-isoquinoline IQ1 FR000IQ1 0

R5 R4 R6 R3

N R7 2 R8 R1 1,3,4,5,6,7,8-isoquinoline IQ2 FR000IQ2 0

R3

N 1 N 3-phthalazine HT1 FR000HT1 0

R4 R3 R 5 N N R6 2 R7 R8 3,4,5,6,7,8-phthalazine HT2 FR000HT2 0 Continued on next page.

230 Table C-11. (continued)

Structure Name 3L 8L Chg O R N 2 N

3 R4 2,4-phthalazinone HT3 FR000HT4 0

R4 R3

N 1 N 3,4-cinnoline CI1 FR000CI1 0

R5 R4 R6 R3

N R7 N 2 R8 3,4,5,6,7,8-cinnoline CI2 FR000CI2 0

R4 R 6 N

1 R7 N 4,6,7-quinazoline IN1 FR000IN1 0

R5 R4 R 6 N

R7 N R2 2 R8 2,4,5,6,7,8-quinazoline IN2 FR000IN2 0

R5 R6 N R3

R7 N R2 1 R8 2,3,5,6,7,8-quinoxaline NO1 FR000NO1 0

R4 N R N 5

1 R2 N N R6 2,4,6,7-pteridine ER1 FR000ER1 0 O N R HN 6

2 H2N N N R7 6,7-pterin ER2 FR000ER2 0

R4 N N

3 R2 N N 2,4-pteridine ER3 FR000ER1 0

R4 N R N 6

4 R2 N N 2,4,6-pteridine ER4 FR000ER1 0 Continued on next page.

231 Table C-11. (continued)

Structure Name 3L 8L Chg O N R HN 6

5 H2N N N 6-pterin ER5 FR000ER1 0

R3

N 1 H 3- LE1 FR000LE1 0

R4

N 2 H 4-indole LE2 FR000LE2 0 R 4 R3

N 3 H 3,4-indole LE3 FR000LE3 0

R3 R5

N 4 H 3,5-indole LE4 FR000LE4 0

R3 R5 R2 N 5 R1 1,2,3,5-indole LE5 FR000LE5 0 R 4 R3 R5 N H

R6 R 1 R7 1 1,3,4,5,6,7- LE6 FR000LE6 0

R6

N N R8 R N N 1 2 H 2,6,8-purine PU1 FR000PU1 0 R 4 R3 R5 N R N 6 H 1 R7 3,4,5,6,7- Z11 FR000Z11 0 N N N 1 R 1-benzotriazole Y11 FR000Y11 0 R N N 1 1-1,3-benzodiazepine N11 FR000N11 0 Continued on next page.

232 Table C-11. (continued)

Structure Name 3L 8L Chg R N

2 N 1-1,4-benzodiazepine N12 FR000N12 0

R4 N

N 3 H O 4-1,5-benzodiazepin-2-one N13 FR000N13 0

R4 R3

1 R7 O O 3,4,7-coumarin CU1 FR000CU1 0

R4 R4 R6 R3

R7 O O 2 R8 3,4,5,6,7,8-coumarin CU2 FR000CU2 0

R9 R2a R O 8 R2b

R7 1 R6 2,2,6,7,8,9-chroman CHR FR000CHR 0

O R2

R3 1 O 2,3-chromone CE1 FR000CE1 0 O

R6 2 O 6-chromone CE2 FR000CE2 0

R10 R11

R7 O

R3 3 R5 O 3,5,7,10,11-flavone CE3 FR000CE3 0 O

O 1 R5 5-1,4-benzodioxane BD1 FR000BD1 0

R2 1 O 2- BF1 FR000BF1 0

R3

2 O 3-benzofuran BF2 FR000BF2 0 Continued on next page.

233 Table C-11. (continued)

Structure Name 3L 8L Chg

R5 3 O 5-benzofuran BF3 FR000BF3 0 O

O

1 R4 4-phthalide PH1 FR000PH1 0 O

O 1 R4 4-1,3-benzodioxole BZ1 FR000BZ1 0 R 4 R3 R5 R2 S R6 1 R7 2,3,4,5,6,7- BN1 FR000BN1 0

R3

S 2 R6 3,6-benzothiophene BN2 FR000BN2 0

R3

R2 S 3 R6 2,3,6-benzothiophene 3BT FR0003BT 0 R 4 R3 R5 S

R6 R 4 R7 1 1,3,4,5,6,7-isobenzothiophene BN4 FR000BN4 0 N R 1 O BXZ FR000BXZ 0 R R3 4 R5 N O R6 1 R7 3,4,5,6,7- BI1 FR000BI1 0

R3 R5 N 2 O 3,5-benzisoxazole BI2 FR000BI2 0

R4 N R5 R2 S R6 1 R7 2,4,5,6,7- BH1 FR000BH1 0 Continued on next page.

234 Table C-11. (continued)

Structure Name 3L 8L Chg N R2 S 2 R6 2,6-benzothiazole BH2 FR000BH2 0 N R2 3 S 2-benzothiazole BH3 FR000BH3 0 O

1 N R3 3-1,4-benzoxazine BX1 FR000BX1 0 O

R6 N O 2 H 6-1,4-benzoxazin-3(4H)-one BX2 FR000BX2 0 O R

1 1-fluorenone 1FO FR0001FO 0 R O

2 1-dibenzofuran X11 FR000X11 0 R N

3 carbazol-9-yl X12 FR000X12 0

R1

1 R10 1,10-anthracene AN1 FR000AN1 0 R O

2 O 1-dioxoanthracene IDA FR000IDA 0

R9

3 R3 O R6 3,6,9-xanthen-9-yl XA1 FR000XA1 0 R O

4 O 1-oxanthrene XO1 FR000XO1 0 R

1 N 1-acridine CR1 FR000CR1 0 Continued on next page.

235 Table C-11. (continued)

Structure Name 3L 8L Chg

R1 R9 R8 R2 R7

R3 N R6 2 R4 R5 1,2,3,4,5,6,7,8,9-acridine CR2 FR000CR2 0

R2 N

1 N 2-phenazine EZ1 FR000EZ1 0

R1 R8 R2 N R7

R3 N R6 2 R4 R5 1,2,3,4,5,6,7,8-phenazine EZ2 FR000EZ2 0 Fused Ring Fragments.

236 REFERENCES [1] T. M. Speight and N. H. G. Holford, editors. Avery’s Drug Treatment. Adis Press, Auckland, New Zealand, 4th edition, 1997. [2] N. A. Roberts, J. A. Martin, D. Kinchington, A. V. Broadhurst, J. C. Craig, I. B. Duncan, S. A. Galpin, B. K. Handa, J. Kay, A. Krohn, R. W. Lambert, J. H. Merrett, J. S. Mills, K. E. B. Parkes, S. Redshaw, A. J. Ritchie, D. L. Taylor, G. J. Thomas, and P. J. Machin. Rational Design of Peptide-Based HIV Proteinase-Inhibitors. Science, 248(4953):358–361, 1990. [3] D. W. Cushman, M. A. Ondetti, E. M. Gordon, S. Natarajan, D. S. Karanewsky, J. Krapcho, and Jr. Petrillo, E. W. Rational design and biochemical utility of specific inhibitors of angiotensin-converting enzyme. J. Cardiovasc. Pharmacol., 10:S17–30, 1987.

[4] A. R. Leach, B. K. Shoichet, and C. E. Peishoff. Prediction of protein-ligand interactions. docking and scoring: Successes and gaps. J. Med. Chem., 49(20):5851–5855, 2006.

[5] P. Ferrara, H. Gohlke, D. J. Price, G. Klebe, and C. L. Brooks. Assessing scoring functions for protein-ligand interactions. J. Med. Chem., 47(12):3032–3047, 2004. [6] M. B. Peters, K. Raha, and K. M. Merz Jr. Quantum mechanics in structure-based drug design. Curr. Opin. Drug Discovery Dev., 9(3):370–379, 2006. [7] K. Raha, A. J. van der Vaart, K. E. Riley, M. B. Peters, L. M. Westerhofft, H. Kim, and K. M. Merz Jr. Pairwise decomposition of residue interaction energies using semiempirical quantum mechanical methods in studies of protein-ligand interaction. J. Am. Chem. Soc., 127(18):6583–6594, 2005. [8] A. R. Ortiz, M. T. Pisabarro, F. Gago, and R. C. Wade. Prediction of Drug-Binding Affinities by Comparative Binding-Energy Analysis. J. Med. Chem., 38(14):2681–2691, 1995. [9] M. B. Peters and K. M. Merz Jr. Semiempirical comparative binding energy analysis (SE-COMBINE) of a series of trypsin inhibitors. J. Chem. Theory Comput., 2(2):383–399, 2006. [10] R. D. Cramer III, D. E. Patterson, and J. D. Bunce. Comparative Molecular Field Analysis (CoMFA). 1. Effect of Shape on Binding of Steroids to Carrier Proteins. J. Am. Chem. Soc., 110:5959–5967, 1988. [11] G. Klebe, U. Abraham, and T. Mietzner. Molecular Similarity Indexes in a Comparative-Analysis (CoMSIA) of Drug Molecules to Correlate and Predict their biological-activity. J. Med. Chem., 37(24):4130–4146, 1994. [12] G. Klebe. Comparative molecular similarity indices analysis: CoMSIA. Persp. Drug Disc. Design, 12:87–104, 1998.

237 [13] F. Estienne, Y. Vander Heyden, and D. L. Massart. Chemometrics and modeling. Chimia, 55(1-2):70–80, 2001. [14] S. Wold, M. Sjostrom, and L. Eriksson. PLS-regression: a basic tool of chemometrics. Chemom. Intell. Lab. Syst., 58(2):109–130, 2001. [15] F. C. Bernstein, T. F. Koetzle, G. J. B. Williams, E. F. Meyer, M. D. Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi. Protein Data Bank - Computer-Based Archival File for Macromolecular Structures. J. Mol. Biol., 112(3):535–542, 1977. [16] D. A. Case, T. A. Darden, T. E. Cheatham, III, C. L. Simmerling, J. Wang, R. E. Duke, R. Luo, K. M. Merz Jr., D. A. Pearlman, M. M. Crowley, R. C. R.C. Walker, W. W. Zhang, B. Wang, S. Hayik, A. Roitberg, G. Seabra, K. F. Wong, F. Paesani, X. Wu, S. Brozell, V. Tsui, H. Gohlke, L. Yang, C. Tan, J. Mongan, V. Hornak, G. Cui, P. Beroza, D. H. Mathews, C. Schafmeister, W. S. Ross, and P. A. Kollman. AMBER 9, 2006. [17] T. Fink, H. Bruggesser, and J. L. Reymond. Virtual exploration of the small-molecule chemical universe below 160 daltons. Angew. Chem., Int., 44(10):1504–1508, 2005. [18] M. A. Koch, A. Schuffenhauer, M. Scheck, S. Wetzel, M. Casaulta, A. Odermatt, P. Ertl, and H. Waldmann. Charting biologically relevant chemical space: A structural classification of natural products (SCONP). Proc. Natl. Acad. Sci. U. S. A., 102(48):17272–17277, 2005. [19] D. G. Lloyd, G. Golfis, A. J. S. Knox, D. Fayne, M. J. Meegan, and T. I. Oprea. Oncology exploration: charting cancer medicinal chemistry space. Drug Discov. Today, 11(3-4):149–159, 2006. [20] T. Fink and J. L. Reymond. Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: Assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery. J. Chem. Inf. Model., 47(2):342–353, 2007.

[21] A. Schuffenhauer, P. Ertl, S. Roggo, S. Wetzel, M. A. Koch, and H. Waldmann. The scaffold tree - visualization of the scaffold universe by hierarchical scaffold classification. J. Chem. Inf. Model., 47(1):47–58, 2007. [22] R. E. Babine and S. L. Bender. Molecular recognition of protein-ligand complexes: Applications to drug design. Chem. Rev., 97(5):1359–1472, 1997. [23] J. G. Robertson. Mechanistic basis of enzyme-targeted drugs. Biochemistry, 44(15):5561–5571, 2005.

238 [24] R. B. Silverman. The organic chemistry of enzyme-catalyzed reactions. Academic Press, San Diego, 2002. [25] P. A. Boriack-Sjodin, S. Zeitlin, H. H. Chen, L. Crenshaw, S. Gross, A. Dantanarayana, P. Delgado, J. A. May, T. Dean, and D. W. Christianson. Structural analysis of inhibitor binding to human carbonic anhydrase II. Protein Sci., 7(12):2483–9, 1998.

[26] Ajay and M. A. Murcko. Computational methods to predict binding free energy in ligand-receptor complexes. J. Med. Chem., 38(26):4953–4967, 1995. [27] J. J. Irwin and B. K. Shoichet. ZINC–a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model., 45(1):177–82, 2005. [28] J. J. Irwin, F. M. Raushel, and B. K. Shoichet. Virtual screening against metalloenzymes for inhibitors and substrates. Biochemistry, 44(37):12316–28, 2005. [29] R. E. Dolle, B. Le Bourdonnec, G. A. Morales, K. J. Moriarty, and J. M. Salvino. Comprehensive survey of combinatorial library synthesis: 2005. J. Comb. Chem., 8(5):597–635, 2006. [30] A. J. S. Knox, M. J. Meegan, G. Carta, and D. G. Lloyd. Considerations in compound database preparation-”hidden” impact on virtual screening results. J. Chem. Inf. Model., 45(6):1908–1919, 2005. [31] A. J. S. Knox, M. J. Meegan, and D. G. Lloyd. Estrogen receptors: Molecular interactions, virtual screening and future prospects. Curr. Top. Med. Chem., 6(3):217–243, 2006. [32] J. C. Baber, A. S. William, Y. H. Gao, and M. Feher. The use of consensus scoring in ligand-based virtual screening. J. Chem. Inf. Model., 46(1):277–288, 2006. [33] W. P. Walters and M. A. Murcko. Prediction of ’drug-likeness’. Adv. Drug Delivery Rev., 54(3):255–271, 2002. [34] M. Feher and J. M. Schmidt. Property distributions: Differences between drugs, natural products, and molecules from combinatorial chemistry. J. Chem. Inf. Comput. Sci., 43(1):218–227, 2003. [35] M. C. Hutter. Separating drugs from nondrugs: A statistical approach using atom pair distributions. J. Chem. Inf. Model., 47(1):186–194, 2007.

[36] M. Snarey, N. K. Terrett, P. Willett, and D. J. Wilton. Comparison of algorithms for dissimilarity-based compound selection. J. Mol. Graphics Modell., 15(6):372–385, 1997.

239 [37] S. L. Dixon and K. M. Merz Jr. One-dimensional molecular representations and similarity calculations: Methodology and validation. J. Med. Chem., 44(23):3795–3809, 2001.

[38] T. Ewing, J. C. Baber, and M. Feher. Novel 2D fingerprints for ligand-based virtual screening. J. Chem. Inf. Model., 46(6):2423–2431, 2006. [39] M. Stahl and H. Mauser. Database clustering with a combination of fingerprint and maximum common substructure methods. J. Chem. Inf. Model., 45(3):542–548, 2005. [40] P. Willett. Searching techniques for databases of two- and three-dimensional chemical structures. J. Med. Chem., 48(13):4183–4199, 2005. [41] I. Muegge, S. L. Heald, and D. Brittelli. Simple selection criteria for drug-like chemical matter. J. Med. Chem., 44(12):1841–1846, 2001. [42] C. A. Lipinski, F. Lombardo, B. W. Dominy, and P. J. Feeney. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Delivery Rev., 23(1-3):3–25, 1997. [43] M. Hornig and A. Klamt. COSMOfrag: A novel tool for high-throughput ADME property prediction and similarity screening based on quantum chemistry. J. Chem. Inf. Model., 45(5):1169–1177, 2005. [44] A. L. Cheng and K. M. Merz Jr. Prediction of aqueous solubility of a diverse set of compounds using quantitative structure-property relationships. J. Med. Chem., 46(17):3572–3580, 2003. [45] A. Cheng and S. L. Dixon. In silico models for the prediction of dose-dependent human hepatotoxicity. J. Comput. Aided Mol. Des., 17(12):811–23, 2003. [46] R. G. Susnow and S. L. Dixon. Use of robust classification techniques for the prediction of human cytochrome P450 2D6 inhibition. J. Chem. Inf. Comput. Sci., 43(4):1308–1315, 2003. [47] W. J. Egan, K. M. Merz Jr., and J. J. Baldwin. Prediction of drug absorption using multivariate statistics. J. Med. Chem., 43(21):3867–3877, 2000.

[48] J. C. Baber and M. Feher. Predicting synthetic accessibility: Application in drug discovery and development. Mini-Rev. Med. Chem., 4(6):681–692, 2004. [49] M. Stahl, N. P. Todorov, T. James, H. Mauser, H. J. Boehm, and P. M. Dean. A validation study on the practical use of automated de novo design. J. Comput.-Aided Mol. Des., 16(7):459–478, 2002. [50] P. M. Dean, D. G. Lloyd, and N. P. Todorov. De novo drug design: Integration of structure-based and ligand-based methods. Curr. Opin. Drug Discovery Dev., 7(3):347–353, 2004.

240 [51] H. Mauser and M. Stahl. Chemical fragment spaces for de novo design. J. Chem. Inf. Model., 47(2):318–324, 2007. [52] D. G. Lloyd, C. L. Buenemann, N. P. Todorov, D. T. Manallack, and P. M. Dean. Scaffold hopping in de novo design. ligand generation in the absence of receptor information. J. Med. Chem., 47(3):493–496, 2004. [53] H. Gohlke, M. Hendlich, and G. Klebe. Predicting binding modes, binding affinities and ’hot spots’ for protein-ligand complexes using a knowledge-based scoring function. Perspect. Drug Discovery Des., 20(1):115–144, 2000. [54] B. A. Grzybowski, A. V. Ishchenko, J. Shimada, and E. I. Shakhnovich. From knowledge-based potentials to combinatorial lead design in silico. Acc. Chem. Res., 35(5):261–269, 2002. [55] B. A. Grzybowski, A. V. Ishchenko, C. Y. Kim, G. Topalov, R. Chapman, D. W. Christianson, G. M. Whitesides, and E. I. Shakhnovich. Combinatorial computational method gives new picomolar ligands for a known enzyme. Proc. Natl. Acad. Sci. U. S. A., 99(3):1270–1273, 2002.

[56] M. Feher, E. Deretey, and S. Roy. BHB: A simple knowledge-based scoring function to improve the efficiency of database screening. J. Chem. Inf. Comput. Sci., 43(4):1316–1327, 2003.

[57] H. F. G. Velec, H. Gohlke, and G. Klebe. DrugScore(CSD)-knowledge-based scoring function derived from small molecule crystal data with superior recognition rate of near-native ligand poses and better affinity prediction. J. Med. Chem., 48(20):6296–6303, 2005. [58] W. D. Cornell, P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz Jr., D. M. Ferguson, D. C. Spellmeyer, T. Fox, J. W. Caldwell, and P. A. Kollman. A 2nd Generation Force-Field for the Simulation of Proteins, Nucleic-Acids, and Organic-Molecules. J. Am. Chem. Soc., 117(19):5179–5197, 1995. [59] W. D. Cornell, P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz Jr., D. M. Ferguson, D. C. Spellmeyer, T. Fox, J. W. Caldwell, and P. A. Kollman. A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J. Am. Chem. Soc., 118(9):2309–2309, 1996. [60] D. A. Case, T. E. Cheatham, T. Darden, H. Gohlke, R. Luo, K. M. Merz Jr., A. Onufriev, C. Simmerling, B. Wang, and R. J. Woods. The AMBER biomolecular simulation programs. J. Comput. Chem., 26(16):1668–1688, 2005.

[61] B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus. CHARMM - a Program for Macromolecular Energy, Minimization, and Dynamics Calculations. J. Comput. Chem., 4(2):187–217, 1983.

241 [62] T. A. Halgren. Merck molecular force field.5. Extension of MMFF94 using experimental data, additional computational data, and empirical rules. J. Com- put. Chem., 17(5-6):616–641, 1996.

[63] T. A. Halgren. Merck molecular force field.3. Molecular geometries and vibrational frequencies for MMFF94. J. Comput. Chem., 17(5-6):553–586, 1996. [64] T. A. Halgren. Merck molecular force field.2. MMFF94 van der waals and electrostatic parameters for intermolecular interactions. J. Comput. Chem., 17(5-6):520–552, 1996. [65] T. A. Halgren. Merck molecular force field.1. Basis, form, scope, parameterization, and performance of MMFF94. J. Comput. Chem., 17(5-6):490–519, 1996. [66] T. A. Halgren. Representation of van der Waals (vdW) Interactions in Molecular Mechanics Force-Fields - Potential Form, Combination Rules, and vdW parameters. J. Am. Chem. Soc., 114(20):7827–7843, 1992. [67] T. A. Halgren and R. B. Nachbar. Merck molecular force field.4. Conformational energies and geometries for MMFF94. J. Comput. Chem., 17(5-6):587–615, 1996. [68] T. A. Halgren. MMFF VII. Characterization of MMFF94, MMFF94s, and other widely available force fields for conformational energies and for intermolecular-interaction energies and geometries. J. Comput. Chem., 20(7):730–748, 1999. [69] T. A. Halgren. MMFF VI. MMFF94S Option for Energy Minimization Studies. J. Comput. Chem., 20(7):720–729, 1999. [70] W. L. Jorgensen and J. Tiradorives. The OPLS Potential Functions for Proteins - Energy Minimizations for Crystals of Cyclic-Peptides and Crambin. J. Am. Chem. Soc., 110(6):1657–1666, 1988. [71] N. L. Allinger, Y. H. Yuh, and J. H. Lii. Molecular Mechanics - the MM3 force-field for Hydrocarbons.1. J. Am. Chem. Soc., 111(23):8551–8566, 1989. [72] M. J. S. Dewar and W. Thiel. Ground States of Molecules. 38. the MNDO method. approximations and Parameters. J. Am. Chem. Soc., 99(15):4899–4907, 1977.

[73] M. J. S. Dewar, E. G. Zoebisch, E. F. Healy, and J. J. P. Stewart. AM1: A New General Purpose Quantum Mechanical Molecular Model. J. Am. Chem. Soc., 107:3902–3909, 1985.

[74] James J. P. Stewart. Optimization of Parameters for Semiempirical Methods I. Method. J. Comp. Chem., 10:209–220, 1989. [75] James J. P. Stewart. Optimization of Parameters for Semiempirical Methods II. Applications. J. Comp. Chem., 10:221–264, 1989.

242 [76] W. Thiel and A. A. Voityuk. Extension of MNDO to d Orbitals: Parameters and Results for the Second-Row Elements and for the Zinc Group. J. Phys. Chem., 100:616–626, 1996.

[77] M. P. Repasky, J. Chandrasekhar, and W. L. Jorgensen. PDDG/PM3 and PDDG/MNDO: Improved semiempirical methods. J. Comput. Chem., 23(16):1601–1622, 2002.

[78] M. Elstner, D. Porezag, G. Jungnickel, J. Elsner, M. Haugk, T. Frauenheim, S. Suhai, and G. Seifert. Self-consistent-charge density-functional tight-binding method for simulations of complex materials properties. Phys. Rev. B, 58(11):7260–7268, 1998. [79] M. Elstner. The SCC-DFTB method and its application to biological systems. Theor. Chem. Acc., 116(1-3):316–325, 2006.

[80] K. W. Sattelmeyer, J. Tirado-Rives, and W. L. Jorgensen. Comparison of SCC-DFTB and NDDO-based semiempirical molecular orbital methods for organic molecules. J. Phys. Chem. A, 110(50):13551–13559, 2006.

[81] A. R. Leach. Molecular modelling: principles and applications. Prentice Hall, Harlow, England; New York, 2nd edition, 2001. [82] B. H. Mevik and H. R. Cederkvist. Mean squared error of prediction (MSEP) estimates for principal component regression (PCR) and partial least squares regression (PLSR). J. Chemom., 18(9):422–429, 2004. [83] R. Guha and P. C. Jurs. Development of linear, ensemble, and nonlinear models for the prediction and interpretation of the biological activity of a set of PDGFR inhibitors. J. Chem. Inf. Comp. Sci., 44(6):2179–2189, 2004. [84] M. Karelson, V. S. Lobanov, and A. R. Katritzky. Quantum-Chemical Descriptors in QSAR/QSPR Studies. Chem. Rev., 96(3):1027–1044, 1996. [85] M. Br¨ustle, B. Beck, T. Schindler, W. King, T. Mitchell, and T. Clark. Descriptors, Physical Properties, and Drug-Likeness. J. Med. Chem., 45:3345–3355, 2002. [86] J. Wan, L. Zhang, G. Yang, and C. Zhan. Quantitative Structure-Activity Relationship for Cyclic Imide Derivatives of Protoporphyrinogen Oxidase Inhibitors: A Study of Quantum Chemical Descriptors from Density Functional Theory. J. Chem. Inf. Comput. Sci., 44:2099–2105, 2004. [87] J. J. Sutherland, L. A. O’Brien, and D. F. Weaver. A Comparison of Methods for Modeling Quantitative Structure-Activity Relationships. J. Med. Chem., 47:5541–5554, 2004. [88] S. Dixon, K. M. Merz Jr., G. Lauri, and J. C. Ianni. QMQSAR: Utilization of a Semiempirical Probe Potential in a Field-Based QSAR Method. J. Comput. Chem., 26:23–34, 2005.

243 [89] A. H. Asikainen, J. Ruuskanen, and K. Tuppurainen. Spectroscopic QSAR methods and self-organizing molecular field analysis for relating molecular structure and estrogenic activity. J. Chem. Inf. Comput. Sci., 43(6):1974–1981, 2003.

[90] A. H. Asikainen, J. Ruuskanen, and K. A. Tuppurainen. Alternative QSAR models for selected estradiol and cytochrome P450 ligands: comparison between classical, spectroscopic, CoMFA and GRID/GOLPE methods. SAR QSAR Environ. Res., 16(6):555–565, 2005. [91] D. B. Turner, P. Willett, A. M. Ferguson, and T. Heritage. Evaluation of a novel infrared range vibration-based descriptor (EVA) for QSAR studies. 1. General application. J. Comput. Aided Mol. Des., 11(4):409–22, 1997. [92] A. M. Ferguson, T. Heritage, P. Jonathon, S. E. Pack, L. Phillips, J. Rogan, and P. J. Snaith. EVA: A new theoretically based molecular descriptor for use in QSAR/QSPR analysis. J. Comput.-Aided Mol. Des., 11(2):143–152, 1997. [93] C. M. R. Ginn, D. B. Turner, P. Willett, A. M. Ferguson, and T. W. Heritage. Similarity searching in files of three-dimensional chemical structures: Evaluation of the EVA descriptor and combination of rankings using data fusion. J. Chem. Inf. Comput. Sci., 37(1):23–37, 1997. [94] T. W. Heritage, A. M. Ferguson, D. B. Turner, and P. Willett. EVA: A novel theoretical descriptor for QSAR studies. Perspect. Drug Discovery Des., 9-11:381–398, 1998. [95] D. B. Turner, P. Willett, A. M. Ferguson, and T. W. Heritage. Evaluation of a novel molecular vibration-based descriptor (EVA) for QSAR studies: 2. Model validation using a benchmark steroid dataset. J. Comput.-Aided Mol. Des., 13(3):271–296, 1999. [96] D. B. Turner and P. Willett. The EVA spectral descriptor. Eur. J. Med. Chem., 35(4):367–375, 2000. [97] D. B. Turner and P. Willett. Evaluation of the EVA descriptor for QSAR studies: 3. the use of a genetic algorithm to search for models with enhanced predictive properties (EVA GA). J. Comput.-Aided Mol. Des., 14(1):1–21, 2000. [98] M. Ford, L. Phillips, and A. Stevens. Optimising the EVA descriptor for prediction of biological activity. Org. Biomol. Chem., 2(22):3301–3311, 2004. [99] K. Tuppurainen. Frontier orbital energies, hydrophobicity and steric factors as physical QSAR descriptors of molecular mutagenicity. a review with a case study: MX compounds. Chemosphere, 38(13):3015–3030, 1999. [100] K. Tuppurainen. EEVA (electronic eigenvalue): A new QSAR/QSPR descriptor for electronic substituent effects based on molecular orbital energies. SAR QSAR Environ. Res., 10(1):39–46, 1999.

244 [101] K. Tuppurainen and J. Ruuskanen. Electronic eigenvalue (EEVA): a new QSAR/QSPR descriptor for electronic substituent effects based on molecular orbital energies. a QSAR approach to the Ah receptor binding affinity of polychlorinated biphenyls (PCBs), dibenzo-p-dioxins (PCDDs) and dibenzofurans (PCDFs). Chemo- sphere, 41(6):843–848, 2000. [102] K. Tuppurainen, M. Viisas, R. Laatikainen, and M. Perakyla. Evaluation of a novel electronic eigenvalue (EEVA) molecular descriptor for QSAR/QSPR studies: Validation using a benchmark steroid data set. J. Chem. Inf. Comput. Sci., 42(3):607–613, 2002. [103] R. Bursi, T. Dao, T. van Wijk, M. de Gooyer, E. Kellenbach, and P. Verwer. Comparative spectra analysis (CoSA): spectra as three-dimensional molecular descriptors for the prediction of biological activities. J. Chem. Inf. Comput. Sci., 39(5):861–867, 1999. [104] E. Besalu, X. Giron´os, L. Amat, and R. Carb´o-Dorca. Molecular Quantum similarity and the fundamentals of QSAR. Acc. Chem. Res., 35:289–295, 2002. [105] R. Carb´o-Dorca and X. Giron´es. Foundation of Quantum Similarity Measures and Their Relationship to QSPR: Density Function Structure, Approximations, and Application Examples. Int. J. Quantum Chem., 101:8–20, 2005. [106] P. Bultinck, T. Kuppens, X. Girone, and R. Carb´o-Dorca. Quantum similarity superposition algorithm (QSSA): A consistent scheme for molecular alignment and molecular similarity based on quantum chemistry. J. Chem. Inf. Comput. Sci., 43(4):1143–1150, 2003. [107] S. E. O’Brien and P. L. A. Popelier. Quantum Molecular Similarity. 3. QTMS Descriptors. J. Chem. Inf. Comput. Sci., 41:764–775, 2001. [108] U. A. Chaudry and P. L. A. Popelier. Estimation of pKa Using Quantum Topological Molecular Similarity Descriptors: Application to Carboxylic Acids, Anilines and Phenols. J. Org. Chem., 69:233–241, 2004. [109] P. Y. Ren and J. W. Ponder. Consistent treatment of inter- and intramolecular polarization in molecular mechanics calculations. J. Comput. Chem., 23(16):1497–1506, 2002. [110] V. Gogonea, D. Suarez, A. van der Vaart, and K. M. Merz Jr. New developments in applying quantum mechanics to proteins. Curr. Opin. Struct. Biol., 11(2):217–223, 2001.

[111] S. L. Dixon and K. M. Merz Jr. Semiempirical Molecular Orbital Calculations with Linear System Size Scaling. J. Chem. Phys., 104:6643–6649, 1996. [112] S. L. Dixon and K. M. Merz Jr. Fast, Accurate Semiempirical Molecular Orbital Calculations for Macromolecules. J. Chem. Phys., 107:879–893., 1997.

245 [113] A. van der Vaart, V. Gogonea, S. L. Dixon, and K. M. Merz Jr. Linear Scaling Molecular Orbital Calculations of Biological Systems Using the Semiempirical Divide and Conquer Method. J. Comput. Chem., 21:1494–1504, 2000.

[114] W. Kohn. Density-Functional Theory for Systems of Very Many Atoms. Int. J. Quantum Chem., 56(4):229–232, 1995. [115] X. P. Li, R. W. Nunes, and D. Vanderbilt. Density-matrix electronic-structure method with linear system-size scaling. Physical Review B, 47(16):10891–10894, 1993. [116] J. J. P. Stewart. Application of localized molecular orbitals to the solution of semiempirical self-consistent field equations. Int. J. Quantum Chem., 58(2):133–146, 1996. [117] K. Raha and K. M. Merz Jr. A Quantum Mechanics Based Scoring Function: Study of Zinc-ion Mediated ligand binding. J. Am. Chem. Soc., 126:1020–1021, 2004. [118] K. Raha and K. M. Merz Jr. Large-Scale Validation of a Quantum Mechanics Based Scoring Function: Predicting the Binding Affinity and the Binding Mode of a Diverse Set of Protein-Ligand Complexes. J. Med. Chem., 48:4558–4575, 2005. [119] H. Fischer and H. Kollmar. Energy Partitioning with the CNDO Method. Theoret. Chim. Acta., 16:163, 1970. [120] M. J. S. Dewar and D. H. Lo. Application of Energy Partitioning to the MINDO/2 Method and a Study of Cope Rearragement. J. Am. Chem. Soc., 93:7201–7205, 1971. [121] S. Olivella and J. Vilarrasa. Application of the Partitioning of Energy in the MNDO Method to the Study of the Basicity of Imidazole, Pyrazole, Oxazole, and Isoxazole. J. Heterocycl. Chem., 18:1189, 1981. [122] F. M. H. Zipse. Charge distribution in the water molecule. a comparison of methods. J. Comp. Chem., 26(1):97–105, 2005. [123] J. B. Li, T. H. Zhu, C. J. Cramer, and D. G. Truhlar. New class IV charge model for extracting accurate partial charges from wave functions. J. Phys. Chem. A, 102(10):1820–1831, 1998. [124] J. B. Li, B. Williams, C. J. Cramer, and D. G. Truhlar. A class IV charge model for molecular excited states. J. Chem. Phys., 110(2):724–733, 1999.

[125] J. B. Li, B. Williams, C. J. Cramer, and D. G. Truhlar. A class IV charge model for molecular excited states. J. Chem. Phys., 111(12):5624–5624, 1999. [126] U. C. Singh and P. A. Kollman. An approach to computing electrostatic charges for molecules. J. Comput. Chem., 5(2):129–145, 1984.

246 [127] C. I. Bayly, P. Cieplak, W. D. Cornell, and P. A. Kollman. A Well-Behaved Electrostatic Potential Based Method Using Charge Restraints for Deriving Atomic Charges - the RESP Model. J. Phys. Chem., 97(40):10269–10280, 1993.

[128] W. D. Cornell, P. Cieplak, C. I. Bayly, and P. A. Kollman. Application of RESP Charges to Calculate Conformational Energies, Hydrogen-Bond Energies, and Free-Energies of Solvation. J. Am. Chem. Soc., 115(21):9620–9631, 1993.

[129] J. M. Wang, P. Cieplak, and P. A. Kollman. How well does a restrained electrostatic potential (RESP) model perform in calculating conformational energies of organic and biological molecules? J. Comput. Chem., 21(12):1049–1074, 2000.

[130] R. C. Wade, A. R. Ortiz, and F. Gago. Comparative binding energy analysis. Persp. Drug Disc. Design, 9-11:19–34, 1998. [131] A. R. Ortiz, M. Pastor, A. Palomer, G. Cruciani, F. Gago, and R. C. Wade. Reliability of comparative molecular field analysis models: Effects of data scaling and variable selection using a set of human synovial fluid phospholipase A(2) inhibitors. J. Med. Chem., 40(7):1136–1148, 1997.

[132] C. Perez, M. Pastor, A. R. Ortiz, and F. Gago. Comparative binding energy analysis of HIV-1 protease inhibitors: Incorporation of solvent effects and validation as a powerful tool in receptor-based drug design. J. Med. Chem., 41(6):836–852, 1998.

[133] T. Wang and R. C. Wade. Comparative binding energy (COMBINE) analysis of influenza neuraminidase-inhibitor complexes. J. Med. Chem., 44(6):961–971, 2001. [134] J. Kmunicek, S. Luengo, F. Gago, A. R. Ortiz, R. C. Wade, and J. Damborsky. Comparative binding energy analysis of the substrate specificity of haloalkane dehalogenase from xanthobacter autotrophicus GJ10. Biochemistry, 40(30):8905–8917, 2001.

[135] T. Wang, S. Tomic, R. R. Gabdoulline, and R. C. Wade. How optimal are the binding energetics of barnase and barstar? Biophys. J., 87(3):1618–1630, 2004. [136] S. Tomic, L. Nilsson, and R. C. Wade. Nuclear receptor-DNA binding specificity: A COMBINE and free-wilson QSAR analysis. J. Med. Chem., 43(9):1780–1792, 2000. [137] T. Wang and R. C. Wade. Comparative binding energy (COMBINE) analysis of OppA-peptide complexes to relate structure to binding thermodynamics. J. Med. Chem., 45(22):4828–4837, 2002. [138] M. Murcia and A. R. Ortiz. Virtual screening with flexible docking and COMBINE-based models. application to a series of factor Xa inhibitors. J. Med. Chem., 47(4):805–820, 2004. [139] K. Hasegawa, T. Kimura, and K. Funatsu. GA strategy for variable selection in QSAR studies: Enhancement of comparative molecular binding energy analysis by GA-based PLS method. Quant. Struct.-Act. Relat., 18(3):262–272, 1999.

247 [140] M. B. Peters. A semiempirical comparative binding energy analysis study of a series of trypsin inhibitors. Master’s thesis, The Pennsylvania State University, 2005. [141] R. Diestel. Graph theory. Springer, Berlin, 2005. [142] P. Labute. On the perception of molecules from 3D atomic coordinates. J. Chem. Inf. Model., 45(2):215–221, 2005. [143] T. R. Cundari, C. Sarbu, and H. F. Pop. Robust fuzzy principal component analysis (FPCA). a comparative study concerning interaction of carbon-hydrogen bonds with molybdenum-oxo bonds. J. Chem. Inf. Comp. Sci., 42(6):1363–1369, 2002. [144] S. Wold, J. Trygg, A. Berglund, and H. Antti. Some recent developments in pls modeling. Chemom. Intell. Lab. Syst., 58(2):131–150, 2001.

[145] G. M. Ullmann, E. W. Knapp, and N. M. Kostic. Computational simulation and analysis of dynamic association between plastocyanin and cytochrome f. consequences for the electron-transfer reaction. J. Am. Chem. Soc., 119(1):42–52, 1997. [146] J. O. A. De Kerpel and U. Ryde. Protein strain in blue copper proteins studied by free energy perturbations. Proteins: Struct. Funct. Genet., 36(2):157–174, 1999.

[147] M. H. M. Olsson and U. Ryde. The influence of axial ligands on the reduction potential of blue copper proteins. J. Biol. Inorg. Chem., 4(5):654–663, 1999. [148] R. Remenyi and P. Comba. A new general molecular mechanics force field for the oxidized form fo blue coppper proteins. J. Inorg. Biochem., 86(1):397–397, 2001. [149] P. Comba, A. Lledos, F. Maseras, and R. Remenyi. Hybrid quantum mechanics/molecular mechanics studies of the active site of the blue copper proteins amicyanin and rusticyanin. Inorg. Chim. Acta, 324(1-2):21–26, 2001. [150] P. Comba and R. Remenyi. A new molecular mechanics force field for the oxidized form of blue copper proteins. J. Comput. Chem., 23(7):697–705, 2002.

[151] D. Suarez, N. Diaz, and K. M. Merz Jr. Ureases: Quantum chemical calculations on cluster models. J. Am. Chem. Soc., 125(50):15324–15337, 2003. [152] G. Estiu and K. M. Merz Jr. Enzymatic catalysis of urea decomposition: Elimination or hydrolysis? J. Am. Chem. Soc., 126(38):11832–11842, 2004. [153] G. Estiu and K. M. Merz Jr. Catalyzed decomposition of urea. Molecular dynamics simulations of the binding of urea to urease. Biochemistry, 45(14):4429–4443, 2006. [154] G. Estiu, D. Suarez, and K. M. Merz Jr. Quantum mechanical and molecular dynamics simulations of ureases and Zn beta-lactamases. J. Comput. Chem., 27(12):1240–1262, 2006.

248 [155] The Apache Project. Xerces-C++ Parser. http://xml.apache.org/xerces-c/ (accessed Oct 1, 2005). [156] A. M. Wollacott. Computational studies of the applicability of semiempirical quan- tum mechanical methods to study protein structure. PhD thesis, The Pennsylvania State University, 2005. [157] A. M. Wollacott and K. M. Merz Jr. Haptic applications for molecular structure manipulation. J. Mol. Graphics Modell., 25(6):801–805, 2007. [158] R. J. F. Branco, P. A. Fernandes, and M. J. Ramos. Molecular dynamics simulations of the enzyme cu, zn superoxide dismutase. J. Phys. Chem. B, 110(33):16754–16762, 2006. [159] T. Wang and J. J. Zhou. 3DFS: A new 3D flexible searching system for use in drug design. J. Chem. Inf. Comput. Sci., 38(1):71–77, 1998. [160] J. M. Wang, R. M. Wolf, J. W. Caldwell, P. A. Kollman, and D. A. Case. Development and testing of a general amber force field. J. Comput. Chem., 25(9):1157–1174, 2004. [161] E. C. Meng and R. A. Lewis. Determination of Molecular Topology and Atomic Hybridization States from Heavy-Atom Coordinates. J. Comput. Chem., 12(7):891–898, 1991. [162] F. H. Allen, O. Kennard, D. G. Watson, L. Brammer, A. G. Orpen, and R. Taylor. Tables of Bond Lengths Determined by X-Ray and Neutron-Diffraction.1. Bond Lengths in Organic-Compounds. J. Chem. Soc., Perkin Trans. 2, (12):S1–S19, 1987. [163] J. C. Baber and E. E. Hodgkin. Automatic Assignment of Chemical Connectivity to Organic-Molecules in the Cambridge Structural Database. J. Chem. Inf. Comput. Sci., 32(5):401–406, 1992. [164] A. G. Orpen, L. Brammer, F. H. Allen, O. Kennard, D. G. Watson, and R. Taylor. Tables of Bond Lengths Determined by X-Ray and Neutron-Diffraction.2. Organometallic Compounds and Co-Ordination Complexes of the D-Block and F-Block Metals. J. Chem. Soc., Dalton Trans., (12):S1–S83, 1989. [165] M. Hendlich, F. Rippmann, and G. Barnickel. BALI: Automatic assignment of bond and atom types for protein ligands in the Brookhaven Protein Databank. J. Chem. Inf. Comput. Sci., 37(4):774–778, 1997. [166] J. M. Wang, W. Wang, P. A. Kollman, and D. A. Case. Automatic atom type and bond type perception in molecular mechanical calculations. J. Mol. Graphics Modell., 25(2):247–260, 2006. [167] B. T. Fan, A. Panaye, J. P. Doucet, and A. Barbu. Ring Perception - A New Algorithm for Directly Finding the Smallest Set of Smallest Rings from a Connection Table. J. Chem. Inf. Comput. Sci., 33(5):657–662, 1993.

249 [168] B. L. Roos-kozel and W. L. Jorgensen. Computer-Assisted Mechanistic Evaluation of Organic-Reactions.2. Perception of Rings, Aromaticity, and Tautomers. J. Chem. Inf. Comput. Sci., 21(2):101–111, 1981.

[169] M. Lipton and W. C. Still. The multiple minimum problem in molecular modeling - tree searching internal coordinate conformational space. J. Comput. Chem., 9(4):343–355, 1988.

[170] J. R. Ullmann. An algorithm for subgraph isomorphism. J. ACM, 23(1):31–42, 1976. [171] Wilson T. Willett, P. and S. F. Reddaway. Atom-by-atom searching using massive parallelism. implementation of the ullmann subgraph isomorphism algorithm on the distributed array processor. J. Chem. Inf. Model., 31(2):225–233, 1991. [172] E. J. Barker, D. Buttar, D. A. Cosgrove, E. J. Gardiner, P. Kitts, P. Willett, and V. J. Gillet. Scaffold hopping using clique detection applied to reduced graphs. J. Chem. Inf. Model., 46(2):503–511, 2006. [173] S. K. Kearsley. On the Orthogonal Transformation Used for Structural Comparisons. Acta Crystallogr., Sect. A: Found. Crystallogr., 45:208–210, 1989. [174] W. Kabsch. Solution for Best Rotation to Relate 2 Sets of Vectors. Acta Crystal- logr., Sect. A: Found. Crystallogr., 32:922–923, 1976. [175] W. Kabsch. Discussion of Solution for Best Rotation to Relate 2 Sets of Vectors. Acta Crystallogr., Sect. A: Found. Crystallogr., 34:827–828, 1978.

[176] G. Carta, V. Onnis, A. J. S. Knox, D. Fayne, and D. G. Lloyd. Permuting input for more effective sampling of 3D conformer space. J. Comput.-Aided Mol. Des., 20(3):179–190, 2006.

[177] C. Lemmen, M. Zimmermann, and T. Lengauer. Multiple molecular superpositioning as an effective tool for virtual database screening. Perspect. Drug Discovery Des., 20(1):43–62, 2000.

[178] F. Daeyaert, M. de Jonge, J. Heeres, L. Koymans, P. Lewi, M. H. Vinkers, and P. A. J. Janssen. A pharmacophore docking algorithm and its application to the cross-docking of 18 HIV-NNRTI’s in their binding pockets. Proteins: Struct. Funct. Genet., 54(3):526–533, 2004. [179] C. Lemmen and T. Lengauer. Computational methods for the structural alignment of molecules. J. Comput.-Aided Mol. Des., 14(3):215–232, 2000.

[180] Q. Chen, R. E. Higgs, and M. Vieth. Geometric accuracy of three-dimensional molecular overlays. J. Chem. Inf. Model., 46(5):1996–2002, 2006. [181] S. K. Drayton, K. Edwards, N. Jewell, D. B. Turner, D. J. Wild, P. Willett, P. M. Wright, and K. Simmons. Similarity searching in files of three-dimensional chemical

250 structures: Identification of bioactive molecules. Internet J. Chem., 1(37):CP3–U34, 1998. [182] F. Melani, P. Gratteri, M. Adamo, and C. Bonaccini. Field interaction and geometrical overlap: A new simplex and experimental design based computational procedure for superposing small ligand molecules. J. Med. Chem., 46(8):1359–1371, 2003.

[183] R. P. Sheridan, R. Nilakantan, J. S. Dixon, and R. Venkataraghavan. The ensemble approach to distance geometry - application to the nicotinic pharmacophore. J. Med. Chem., 29(6):899–906, 1986.

[184] S. K. Kearsley and G. M. Smith. An alternative method for the alignment of molecular structures: Maximizing electrostatic and steric overlap. Tetrahedron Comput. Methodol., 3:615–633, 1990.

[185] A. C. Good, E. E. Hodgkin, and W. G. Richards. Utilization of gaussian functions for the rapid evaluation of molecular similarity. J. Chem. Inf. Comput. Sci., 32(3):188–191, 1992.

[186] Y. C. Martin, M. G. Bures, E. A. Danaher, J. Delazzer, I. Lico, and P. A. Pavlik. A fast new approach to pharmacophore mapping and its application to dopaminergic and benzodiazepine agonists. J. Comput.-Aided Mol. Des., 7(1):83–102, 1993.

[187] B. B. Masek, A. Merchant, and J. B. Matthew. Molecular Shape Comparison of Angiotensin-II Receptor Antagonists. J. Med. Chem., 36(9):1230–1238, 1993. [188] G. Klebe, T. Mietzner, and F. Weber. Different approaches toward an automatic structural alignment of drug molecules - applications to sterol mimics, thrombin and thermolysin inhibitors. J. Comput.-Aided Mol. Des., 8(6):751–778, 1994. [189] A. N. Jain, T. G. Dietterich, R. H. Lathrop, D. Chapman, R. E. Critchlow, B. E. Bauer, T. A. Webster, and T. Lozanoperez. Compass - a shape-based machine learning tool for drug design. J. Comput.-Aided Mol. Des., 8(6):635–652, 1994. [190] G. Jones, P. Willett, and R. C. Glen. A genetic algorithm for flexible molecular overlay and pharmacophore elucidation. J. Comput.-Aided Mol. Des., 9(6):532–549, 1995. [191] R. A. Dammkoehler, S. F. Karasek, E. F. B. Shands, and G. R. Marshall. Sampling conformational hyperspace: Techniques for improving completeness. J. Comput.- Aided Mol. Des., 9(6):491–499, 1995. [192] C. Mcmartin and R. S. Bohacek. Flexible Matching of Test Ligands to a 3D Pharmacophore Using a Molecular Superposition Force-Field - Comparison of Predicted and Experimental Conformations of Inhibitors of 3 Enzymes. J. Comput.- Aided Mol. Des., 9(3):237–250, 1995.

251 [193] T. D. J. Perkins, J. E. J. Mills, and P. M. Dean. Molecular surface-volume and property matching to superpose flexible dissimilar molecules. J. Comput.-Aided Mol. Des., 9(6):479–490, 1995.

[194] M. Petitjean. Geometric molecular similarity from volume-based distance minimization - application to saxitoxin and tetrodotoxin. J. Comput. Chem., 16(1):80–90, 1995.

[195] J. A. Grant, M. A. Gallardo, and B. T. Pickup. A fast method of molecular shape comparison: A simple application of a gaussian description of molecular shape. J. Comput. Chem., 17(14):1653–1666, 1996.

[196] C. Lemmen and T. Lengauer. Time-efficient flexible superposition of medium-sized molecules. J. Comput.-Aided Mol. Des., 11(4):357–368, 1997. [197] A. J. Mcmahon and P. M. King. Optimization of Carbo molecular similarity index using gradient methods. J. Comput. Chem., 18(2):151–158, 1997. [198] A. Coss´e-Barbi and M. Raji. Discrete pattern recognition by fitting onto a continuous function. J. Comput. Chem., 18(15):1875–1892, 1997. [199] J. Mestres, D. C. Rohrer, and G. M. Maggiora. Mimic: A molecular-field matching program. exploiting applicability of molecular similarity approaches. J. Comput. Chem., 18(7):934–954, 1997. [200] J. W. M. Nissink, M. L. Verdonk, J. Kroon, T. Mietzner, and G. Klebe. Superposition of molecules: Electron density fitting by application of fourier transforms. J. Comput. Chem., 18(5):638–645, 1997. [201] M. F. Parretti, R. T. Kroemer, J. H. Rothman, and W. G. Richards. Alignment of molecules by the Monte Carlo optimization of molecular similarity indices. J. Comput. Chem., 18(11):1344–1353, 1997. [202] M. Cocchi and P. G. De Benedetti. Use of the supermolecule approach to derive molecular similarity descriptors for QSAR analysis. J. Mol. Model., 4(3):113–131, 1998. [203] M. C. De Rosa and A. Berglund. A new method for predicting the alignment of flexible molecules and orienting them in a receptor cleft of known structure. J. Med. Chem., 41(5):691–698, 1998. [204] S. Handschuh, M. Wagener, and J. Gasteiger. Superposition of three-dimensional chemical structures allowing for conformational flexibility by a hybrid method. J. Chem. Inf. Comput. Sci., 38(2):220–232, 1998. [205] T. Wang and J. J. Zhou. 3DFS: 3D flexible searching system for lead discovery - new version 1.2. Journal of Molecular Modeling, 5(11):231–251, 1999.

252 [206] C. Lemmen, C. Hiller, and T. Lengauer. RigFit: A new approach to superimposing ligand molecules. J. Comput.-Aided Mol. Des., 12(5):491–502, 1998. [207] M. D. Miller, R. P. Sheridan, and S. K. Kearsley. SQ: a program for rapidly producing pharmacophorically relevent molecular superpositions. J. Med. Chem., 42(9):1505–14, 1999. [208] G. Klebe, T. Mietzner, and F. Weber. Methodological developments and strategies for a fast flexible superposition of drug-size molecules. J. Comput.-Aided Mol. Des., 13(1):35–49, 1999. [209] M. de Caceres, J. Villa, J. J. Lozano, and F. Sanz. MIPSIM: similarity analysis of molecular interaction potentials. Bioinformatics, 16(6):568–569, 2000. [210] D. A. Cosgrove, D. M. Bayada, and A. P. Johnson. A novel method of aligning molecules by local surface shape similarity. J. Comput.-Aided Mol. Des., 14(6):573–591, 2000. [211] M. Feher and J. M. Schmidt. Multiple flexible alignment with seal: A study of molecules acting on the colchicine binding site. J. Chem. Inf. Comput. Sci., 40(2):495–502, 2000. [212] X. Girones, D. Robert, and R. Carb´o-Dorca. TGSA: A molecular superposition program based on topo-geometrical considerations. J. Comp. Chem., 22(2):255–263, 2001. [213] X. Girones and R. Carb´o-Dorca. TGSA-flex: Extending the capabilities of the topo-geometrical superposition algorithm to handle flexible molecules. J. Comp. Chem., 25(2):153–159, 2004. [214] P. Labute, C. Williams, M. Feher, E. Sourial, and J. M. Schmidt. Flexible alignment of small molecules. J. Med. Chem., 44(10):1483–1490, 2001. [215] J. E. J. Mills, I. J. P. de Esch, T. D. J. Perkins, and P. M. Dean. Slate: A method for the superposition of flexible ligands. J. Comput.-Aided Mol. Des., 15(1):81–96, 2001. [216] M. C. Pitman, W. K. Huber, H. Horn, A. Kramer, J. E. Rice, and W. C. Swope. FLASHFLOOD: A 3D field-based similarity search and alignment method for flexible molecules. J. Comput.-Aided Mol. Des., 15(7):587–612, 2001. [217] A. Kr¨amer, H. W. Horn, and J. E. Rice. Fast 3D molecular superposition and similarity search in databases of flexible molecules. J. Comput.-Aided Mol. Des., 17(1):13–38, 2003. [218] S. P. Korhonen, K. Tuppurainen, R. Laatikainen, and M. Perakyla. Comparing the performance of FLUFF-BALL to SEAL-CoMFA with a large diverse estrogen data set: From relevant superpositions to solid predictions. J. Chem. Inf. Model., 45(6):1874–1883, 2005.

253 [219] A. J. Tervo, T. Ronkko, T. H. Nyronen, and A. Poso. BRUTUS: Optimization of a grid-based similarity function for rigid-body molecular superposition. I. alignment and virtual screening applications. J. Med. Chem., 48(12):4076–4086, 2005.

[220] T. Ronkko, A. J. Tervo, J. Parkkinen, and A. Poso. BRUTUS: Optimization of a grid-based similarity function for rigid-body molecular superposition. II. description and characterization. J. Comput.-Aided Mol. Des., 20(4):227–236, 2006.

[221] S. J. Cho and Y. X. Sun. FLAME: A program to flexibly align molecules. J. Chem. Inf. Model., 46(1):298–306, 2006. [222] J. Marialke, R. Korner, S. Tietze, and J. Apostolakis. Graph-based molecular alignment (GMA). J. Chem. Inf. Model., 47(2):591–601, 2007. [223] G. Klebe and T. Mietzner. A fast and efficient method to generate biologically relevant conformations. J. Comput.-Aided Mol. Des., 8(5):583–606, 1994. [224] J. Sadowski and J. Bostr¨om. Mimumba revisited: Torsion angle rules for conformer generation derived from x-ray structures. J. Chem. Inf. Model., 46(6):2305–2309, 2006. [225] F. Daeyaert, M. de Jonge, J. Heeres, L. Koymans, P. Lewi, W. van den Broeck, and M. Vinkers. Pareto optimal flexible alignment of molecules using a non-dominated sorting genetic algorithm. Chemom. Intell. Lab. Syst., 77(1-2):232–237, 2005. [226] A. Strizhev, E.J. Abrahamian, S. Choi, J.M. Leonard, P.R.N. Wolohan, and R.D. Clark. The Effects of Biasing Torsional Mutations in a Conformational GA. J. Chem. Inf. Model., 46(4):1862–1870, 2006. [227] D.K. Agrafiotis, A.C. Gibbs, F. Zhu, S. Izrailev, and E. Martin. Conformational sampling of bioactive molecules: A comparative study. J. Chem. Inf. Model., 47(3):1067–1086, 2007. [228] J. Bostr¨om, P. O. Norrby, and T. Liljefors. Conformational energy penalties of protein-bound ligands. J. Comput.-Aided Mol. Des., 12(4):383–396, 1998. [229] J. Bostr¨om. Reproducing the conformations of protein-bound ligands: A critical evaluation of several popular conformational searching tools. J. Comput.-Aided Mol. Des., 15(12):1137–1152, 2001. [230] D. J. Diller and K. M. Merz Jr. Can we separate active from inactive conformations? J. Comput.-Aided Mol. Des., 16(2):105–112, 2002.

[231] J. Bostr¨om, J. R. Greenwood, and J. Gottfries. Assessing the performance of omega with respect to retrieving bioactive conformations. J. Mol. Graphics Modell., 21(5):449–462, 2003.

[232] S. Putta, G. A. Landrum, and J. E. Penzotti. Conformation mining: An algorithm for finding biologically relevant. J. Med. Chem., 48(9):3313–3318, 2005.

254 [233] S. L. Dixon and K. M. Merz Jr. QMALIGN. [234] C. Lemmen, T. Lengauer, and G. Klebe. FLEXS: A method for fast flexible ligand superposition. J. Med. Chem., 41(23):4502–4520, 1998. [235] R Development Core Team. R: A language and environment for statistical comput- ing. R Foundation for Statistical Computing, Vienna, Austria, 2005. [236] T. M. Willson, P. J. Brown, D. D. Sternbach, and B. R. Henke. The PPARs: From orphan receptors to drug discovery. J. Med. Chem., 43(4):527–550, 2000.

[237] J. C. Parker. Troglitazone: the discovery and development of a novel therapy for the treatment of type 2 diabetes mellitus. Adv. Drug Deliv. Rev., 54(9):1173–97, 2002. [238] P. J. Rybczynski, R. E. Zeck, J. Dudash, D. W. Combs, T. P. Burris, M. Yang, M. C. Osborne, X. L. Chen, and K. T. Demarest. Benzoxazinones as PPAR gamma agonists. 2. SAR of the amide substituent and in vivo results in a type 2 diabetes model. J. Med. Chem., 47(1):196–209, 2004.

[239] C. Z. Liao, A. H. Xie, L. M. Shi, J. J. Zhou, and X. P. Lu. Eigenvalue analysis of peroxisome proliferator-activated receptor gamma agonists. J. Chem. Inf. Comput. Sci., 44(1):230–238, 2004.

[240] C. Z. Liao, A. H. Xie, J. J. Zhou, L. M. Shi, Z. B. Li, and X. P. Lu. 3D QSAR studies on peroxisome proliferator-activated receptor gamma agonists using CoMFA and CoMSIA. J. Mol. Model., 10(3):165–177, 2004.

[241] T. Tuccinardi, E. Nuti, G. Ortore, C. T. Supuran, A. Rossello, and A. Martinelli. Analysis of human carbonic anhydrase II: Docking reliability and receptor-based 3D-QSAR study. J. Chem. Inf. Model., 47(2):515–525, 2007.

[242] C.-Y. Kim, D. A. Whittington, J. S. Chang, J. Liao, J.A. May, and D.W. Christianson. Structural Aspects of Isozyme Selectivity in the Binding of Inhibitors to Carbonic Anhydrases II and IV. J. Med. Chem., 45(4):888–893, 2002.

[243] B. A. Grzybowski, A. V. Ishchenko, C. Y. Kim, G. Topalov, R. Chapman, D. W. Christianson, G. M. Whitesides, and E. I. Shakhnovich. Combinatorial computational method gives new picomolar ligands for a known enzyme. Proc. Natl. Acad. Sci. U. S. A., 99(3):1270–3, 2002. [244] S. Gr¨uneberg, M. T. Stubbs, and G. Klebe. Successful Virtual Screening for Novel Inhibitors of Human Carbonic Anhydrase: Strategy and Experimental Confirmation. J. Med. Chem., 45(17):3588–3602, 2002. [245] G. M. Smith, R. S. Alexander, D. W. Christianson, B. M. McKeever, G. S. Ponticello, J. P. Springer, W. C. Randall, J. J. Baldwin, and C. N. Habecker. Positions of His-64 and a bound water in human carbonic anhydrase II upon binding three structurally related inhibitors. Protein Sci., 3(1):118–25, 1994.

255 [246] A. Weber, A. Casini, A. Heine, D. Kuhn, C.T. Supuran, A. Scozzafava, and G. Klebe. Unexpected Nanomolar Inhibition of Carbonic Anhydrase by COX-2-Selective Celecoxib: New Pharmacological Opportunities Due to Related Binding Site Recognition. J. Med. Chem., 47(3):550–557, 2004. [247] R. Recacha, M. J. Costanzo, B. E. Maryanoff, and D. Chattopadhyay. Crystal structure of human carbonic anhydrase II complexed with an anti-convulsant sugar sulphamate. Biochem. J., 361(3):437–41, 2002. [248] M. D. Lloyd, N. Thiyagarajan, Y. T. Ho, L. W. L. Woo, O. B. Sutcliffe, A. Purohit, M. J. Reed, K. R. Acharya, and B. V. L. Potter. First Crystal Structures of Human Carbonic Anhydrase II in Complex with Dual Aromatase-Steroid Sulfatase Inhibitors. Biochemistry, 44(18):6858–6866, 2005. [249] C.-Y. Kim, P. P. Chandra, A. Jain, and D. W. Christianson. Fluoroaromatic-Fluoroaromatic Interactions between Inhibitors Bound in the Crystal Lattice of Human Carbonic Anhydrase II. J. Am. Chem. Soc., 123(39):9620–9627, 2001. [250] V. Menchise, G. DeSimone, V. Alterio, A. DiFiore, C. Pedone, A. Scozzafava, and C. T. Supuran. Carbonic Anhydrase Inhibitors: Stacking with Phe131 Determines Active Site Binding Region of Inhibitors As Exemplified by the X-ray Crystal Structure of a Membrane-Impermeant Antitumor Sulfonamide Complexed with Isozyme II. J. Med. Chem., 48(18):5721–5727, 2005.

[251] R. D. Hancock. Molecular Mechanics Calculations as a Tool in Coordination Chemistry. Prog. Inorg. Chem., 37:187–291, 1989. [252] S. C. Hoops, K. W. Anderson, and K. M. Merz Jr. Force-Field Design for Metalloproteins. J. Am. Chem. Soc., 113(22):8262–8270, 1991. [253] Cieplak P. Cornell W. Bayly, C. I. and P. A. Kollman. A well-behaved electrostatic potential based method using charge restraints for deriving atomic charges: the resp model. J. Phys. Chem., 97(40):10269–10280, 1993. [254] R. H. Stote and M. Karplus. Zinc binding in proteins and solution: a simple but accurate nonbonded representation. Proteins, 23(1):12–31, 1995. [255] D. V. Sakharov and C. Lim. Zn protein simulations including charge transfer and local polarization effects. J. Am. Chem. Soc., 127(13):4921–4929, 2005. [256] J. Aqvist and A. Warshel. Computer simulation of the initial proton transfer step in human carbonic anhydrase i. J. Mol. Biol., 224(1):7–14, 1992.

[257] Y. P. Pang, K. Xu, J. E. Yazal, and F. G. Prendergas. Successful molecular dynamics simulation of the zinc-bound farnesyltransferase using the cationic dummy atom approach. Protein Sci., 9(10):1857–65, 2000.

256 [258] Y. P. Pang. Successful molecular dynamics simulation of two zinc complexes bridged by a hydroxide in phosphotriesterase using the cationic dummy atom method. Proteins, 45(3):183–9, 2001.

[259] A. Vedani and D. W. Huhta. A New Force-Field for Modeling Metalloproteins. J. Am. Chem. Soc., 112(12):4759–4767, 1990. [260] N. Gresh, J. P. Piquemal, and M. Krauss. Representation of Zn(II) complexes in polarizable molecular mechanics. Further refinements of the electrostatic and short-range contributions. Comparisons with parallel ab initio computations. J. Comput. Chem., 26(11):1113–30, 2005.

[261] N. Gresh. Development, validation, and applications of anisotropic polarizable molecular mechanics to study ligand and drug-receptor interactions. Curr. Pharm. Des., 12(17):2121–58, 2006.

[262] A. K. Rappe, C. J. Casewit, K. S. Colwell, W. A. Goddard, and W. M. Skiff. UFF, a Full Periodic-Table Force-Field for Molecular Mechanics and Molecular-Dynamics Simulations. J. Am. Chem. Soc., 114(25):10024–10035, 1992.

[263] A. K. Rappe, K. S. Colwell, and C. J. Casewit. Application of a Universal Force-Field to Metal-Complexes. Inorg. Chem., 32(16):3438–3450, 1993. [264] J. M. Sirovatka, A. K. Rappe, and R. G. Finke. Molecular mechanics studies of coenzyme B-12 complexes with constrained Co-N(axial-base) bond lengths: introduction of the universal force field (UFF) to coenzyme B-12 chemistry and its use to probe the plausibility of an axial-base-induced, ground-state corrin butterfly conformational steric effect. Inorg. Chim. Acta, 300:545–555, 2000. [265] P. Brandt, T. Norrby, E. Akermark, and P. O. Norrby. Molecular mechanics (MM3*) parameters for ruthenium(ii)-polypyridyl complexes. Inorg. Chem., 37(16):4120–4127, 1998. [266] H. M. Marques and K. L. Brown. A Molecular Mechanics Force-Field for the Cobalt Corrinoids. J. Mol. Struct. (Theochem), 340:97–124, 1995.

[267] K. L. Brown, X. Zou, and H. M. Marques. NMR-restrained molecular modeling of cobalt corrinoids: cyanocobalamin (vitamin B-12) and methylcobalt corrinoids. J. Mol. Struct. (Theochem), 453:209–224, 1998.

[268] H. M. Marques and K. L. Brown. The structure of cobalt corrinoids based on molecular mechanics and NOE-restrained molecular mechanics and dynamics simulations. Coord. Chem. Rev., 192:127–153, 1999.

[269] H. M. Marques, B. Ngoma, T. J. Egan, and K. L. Brown. Parameters for the AMBER force field for the molecular mechanics modeling of the cobalt corrinoids. J. Mol. Struct., 561(1-3):71–91, 2001.

257 [270] J. Aqvist and A. Warshel. Free-Energy Relationships in Metalloenzyme-Catalyzed Reactions - Calculations of the Effects of Metal-Ion Substitutions in Staphylococcal Nuclease. J. Am. Chem. Soc., 112(8):2860–2868, 1990.

[271] U. Ryde. Molecular-Dynamics Simulations of Alcohol-Dehydrogenase with a 4-Coordinate or 5-Coordinate Catalytic Zinc Ion. Proteins: Struct. Funct. Genet., 21(1):40–56, 1995.

[272] U. Ryde. On the Role of Glu-68 in Alcohol-Dehydrogenase. Protein Sci., 4(6):1124–1132, 1995. [273] U. Ryde. Carboxylate binding modes in zinc proteins: A theoretical study. Biophys. J., 77(5):2777–2787, 1999. [274] R. D. Hancock, J. S. Weaving, and H. M. Marques. A Molecular Mechanics Model of the Metalloporphyrins - the Role of Steric Hindrance in Discrimination in Favor of Dioxygen Relative to Carbon-Monoxide in Some Heme Models. J. Chem. Soc., Chem. Commun., (16):1176–1178, 1989. [275] H. M. Marques and I. Cukrowski. Molecular mechanics modelling of porphyrins. using artificial neural networks to develop metal parameters for four-coordinate metalloporphyrins. Phys. Chem. Chem. Phys., 4(23):5878–5887, 2002. [276] H. M. Marques and K. L. Brown. Molecular mechanics and molecular dynamics simulations of porphyrins, metalloporphyrins, heme proteins and cobalt corrinoids. Coord. Chem. Rev., 225(1-2):123–158, 2002. [277] C. E. Skopec, J. M. Robinson, I. Cukrowski, and H. M. Marques. Using artificial neural networks to develop molecular mechanics parameters for the modelling of metalloporphyrins. III. five coordinate Zn(II) porphyrins and the metalloprophyrins of the early 3d metals. J. Mol. Struct., 738(1-3):67–78, 2005.

[278] C. E. Skopec, I. Cukrowski, and H. M. Marques. Using artificial neural networks to develop molecular mechanics parameters for the modelling of metalloporphyrins: Part IV. Five-, six-coordinate metalloporphyrins of Mn, Co, Ni and Cu. J. Mol. Struct., 783(1-3):21–33, 2006. [279] P. O. Norrby and T. Liljefors. Automated molecular mechanics parameterization with simultaneous utilization of experimental and quantum mechanical data. J. Comput. Chem., 19(10):1146–1166, 1998. [280] P. O. Norrby and P. Brandt. Deriving force field parameters for coordination complexes. Coord. Chem. Rev., 212:79–109, 2001.

[281] K. M. Merz Jr. CO2 Binding to Human Carbonic Anhydrase-II. J. Am. Chem. Soc., 113(2):406–411, 1991. [282] K. M. Merz Jr., M. A. Murcko, and P. A. Kollman. Inhibition of Carbonic-Anhydrase. J. Am. Chem. Soc., 113(12):4484–4490, 1991.

258 [283] N. Diaz, D. Suarez, and K. M. Merz Jr. Hydration of zinc ions: theoretical study of [Zn(H2O)(4)](H2O)(8)(2+) and [Zn(H2O)(6)](H2O)(6)(2+). Chem. Phys. Lett., 326(3-4):288–292, 2000.

[284] N. Diaz, D. Suarez, and K. M. M. Merz Jr. Zinc metallo-beta-lactamase from Bacteroides fragilis: A quantum chemical study on model systems of the active site. J. Am. Chem. Soc., 122(17):4197–4208, 2000.

[285] N. Diaz, D. Suarez, and K. M. Merz Jr. Molecular dynamics simulations of the mononuclear zinc-beta-lactamase from bacillus cereus complexed with benzylpenicillin and a quantum chemical study of the reaction mechanism. J. Am. Chem. Soc., 123(40):9867–9879, 2001. [286] N. Diaz, D. Suarez, T. L. Sordo, and K. M. Merz Jr. A theoretical study of the aminolysis reaction of lysine 199 of human serum albumin with benzylpenicillin: Consequences for immunochemistry of penicillins. J. Am. Chem. Soc., 123(31):7574–7583, 2001. [287] N. Diaz, D. Suarez, T. L. Sordo, and K. M. Merz Jr. Acylation of class a beta-lactamases by penicillins: A theoretical examination of the role of serine 130 and the beta-lactam carboxylate group. J. Phys. Chem. B, 105(45):11302–11313, 2001. [288] D. Suarez and K. M. Merz Jr. Molecular dynamics simulations of the mononuclear zinc-beta-lactamase from Bacillus cereus. J. Am. Chem. Soc., 123(16):3759–3770, 2001. [289] N. Diaz, T. L. Sordo, K. M. Merz Jr., and D. Suarez. Insights into the acylation mechanism of class A beta-lactamases from molecular dynamics simulations of the TEM-1 enzyme complexed with benzylpenicillin. J. Am. Chem. Soc., 125(3):672–684, 2003. [290] N. Diaz, D. Suarez, K. M. Merz Jr., and T. L. Sordo. Molecular dynamics simulations of the TEM-1,beta-lactamase complexed with cephalothin. J. Med. Chem., 48(3):780–791, 2005. [291] D. Suarez, E. N. Brothers, and K. M. Merz Jr. Insights into the structure and dynamics of the dinuclear zinc beta-lactamase site from Bacteroides fragilis. Bio- chemistry, 41(21):6615–6630, 2002. [292] D. Suarez, N. Diaz, and K. M. Merz Jr. Molecular dynamics simulations of the dinuclear zinc-beta-lactamase from bacteroides fragilis complexed with imipenem. J. Comput. Chem., 23(16):1587–1600, 2002. [293] G. Cui, B. Wang, and K. M. Merz Jr. Computational studies of the farnesyltransferase ternary complex - Part I: Substrate binding. Biochemistry, 44(50):16513–16523, 2005.

259 [294] J. R. Collins, D. L. Camper, and G. H. Loew. Valproic Acid Metabolism by Cytochrome-P450 - a Theoretical-Study of Stereoelectronic Modulators of Product Distribution. J. Am. Chem. Soc., 113(7):2736–2743, 1991.

[295] J. R. Collins, P. Du, and G. H. Loew. Molecular-Dynamics Simulations of the Resting and Hydrogen Peroxide-Bound States of Cytochrome-C Peroxidase. Bio- chemistry, 31(45):11166–11174, 1992.

[296] S. J. Yao, J. P. Plastaras, and L. G. Marzilli. A Molecular Mechanics Amber-Type Force-Field for Modeling Platinum Complexes of Guanine Derivatives. Inorg. Chem., 33(26):6061–6077, 1994.

[297] M. M. Harding. The geometry of metal-ligand interactions relevant to proteins. Acta Crystallogr., Sect. D: Biol. Crystallogr., 55:1432–43, 1999. [298] M. M. Harding. The geometry of metal-ligand interactions relevant to proteins. II. angles at the metal atom, additional weak metal-donor interactions. Acta Crystallogr., Sect. D: Biol. Crystallogr., 56:857–67, 2000. [299] M. M. Harding. Geometry of metal-ligand interactions in proteins. Acta Crystallogr., Sect. D: Biol. Crystallogr., 57:401–11, 2001. [300] M. M. Harding. Metal-ligand geometry relevant to proteins and in proteins: sodium and potassium. Acta Crystallogr., Sect. D: Biol. Crystallogr., 58:872–4, 2002. [301] M. M. Harding. The architecture of metal coordination groups in proteins. Acta Crystallogr., Sect. D: Biol. Crystallogr., 60:849–59, 2004. [302] M. M. Harding. Small revisions to predicted distances around metal sites in proteins. Acta Crystallogr., Sect. D: Biol. Crystallogr., 62:678–82, 2006.

[303] J. Aqvist. Ion Water Interaction Potentials Derived from Free-Energy Perturbation Simulations. J. Phys. Chem., 94(21):8021–8024, 1990. [304] A. Bondi. van Der Waals Volumes + Radii. J. Phys. Chem., 68(3):441–451, 1964.

[305] S. S. Batsanov. van der Waals radii of elements. Inorg. Mater., 37(9):871–885, 2001. [306] S. S. Batsanov. The determination of van der Waals radii from the structural characteristics of metals. Russ. J. Phys. Chem., 74(7):1144–1147, 2000. [307] D. Asthagiri, L. R. Pratt, M. E. Paulaitis, and S. B. Rempe. Hydration structure and free energy of biomolecularly specific aqueous dications, including Zn2+ and first transition row metals. J. Am. Chem. Soc., 126(4):1285–1289, 2004. [308] C. S. Babu and C. Lim. Empirical force fields for biologically active divalent metal cations in water. J. Phys. Chem. A, 110(2):691–699, 2006. [309] C. S. Babu and C. Lim. A new interpretation of the effective born radius from simulation and experiment. Chem. Phys. Lett., 310(1-2):225–228, 1999.

260 [310] C. S. Babu and C. Lim. Theory of ionic hydration: Insights from molecular dynamics simulations and experiment. J. Phys. Chem. B, 103(37):7958–7968, 1999. [311] A. C. Vaiana, A. Schulz, J. Wolfrum, M. Sauer, and J. C. Smith. Molecular mechanics force field parameterization of the fluorescent probe rhodamine 6G using automated frequency matching. J. Comput. Chem., 24(5):632–639, 2003. [312] M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, Jr J. A. Montgomery, T. Vreven, K. N. Kudin, J. C. Burant, J. M. Millam, S. S. Iyengar, J. Tomasi, V. Barone, B. Mennucci, M. Cossi, G. Scalmani, N. Rega, G. A. Petersson, H. Nakatsuji, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao, H. Nakai, M. Klene, X. Li, J. E. Knox, H. P. Hratchian, J. B. Cross, V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts, R. E. Stratmann, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli, J. W. Ochterski, P. Y. Ayala, K. Morokuma, G. A. Voth, P. Salvador, J. J. Dannenberg, V. G. Zakrzewski, S. Dapprich, A. D. Daniels, M. C. Strain, O. Farkas, D. K. Malick, A. D. Rabuck, K. Raghavachari, J. B. Foresman, J. V. Ortiz, Q. Cui, A. G. Baboul, S. Clifford, J. Cioslowski, B. B. Stefanov, G. Liu, A. Liashenko, P. Piskorz, I. Komaromi, R. L. Martin, D. J. Fox, T. Keith, M. A. Al Laham, C. Y. Peng, A. Nanayakkara, M. Challacombe, P. M. W. Gill, B. Johnson, W. Chen, M. W. Wong, C. Gonzalez, and J. A. Pople. Gaussian 03, revision c.02. Gaussian, Inc., Wallingford, CT, 2004. [313] B. H. Besler, K. M. Merz Jr., and P. A. Kollman. Atomic Charges Derived from Semiempirical Methods. J. Comput. Chem., 11(4):431–439, 1990. [314] P. Cieplak, W. D. Cornell, C. Bayly, and P. A. Kollman. Application of the Multimolecule and Multiconformational RESP Methodology to Biopolymers - Charge Derivation for DNA, RNA, and Proteins. J. Comput. Chem., 16(11):1357–1377, 1995. [315] A. D. Becke. Density-Functional Exchange-Energy Approximation with Correct Asymptotic-Behavior. Phys. Rev. A, 38(6):3098–3100, 1988. [316] C. T. Lee, W. T. Yang, and R. G. Parr. Development of the Colle-Salvetti Correlation-Energy Formula into a Functional of the Electron-Density. Phys. Rev. B, 37(2):785–789, 1988. [317] A. D. Becke. Density-Functional Thermochemistry.3. the Role of Exact Exchange. J. Chem. Phys., 98(7):5648–5652, 1993. [318] P. E. M. Siegbahn and T. Borowski. Modeling enzymatic reactions involving transition metals. Acc. Chem. Res., 39(10):729–738, 2006.

[319] A. Blondel and M. Karplus. New formulation for derivatives of torsion angles and improper torsion angles in molecular mechanics: Elimination of singularities. J. Comput. Chem., 17(9):1132–1141, 1996.

261 [320] W. C. Swope and D. M. Ferguson. Alternative expressions for energies and forces due to angle bending and torsional energy. J. Comput. Chem., 13(5):585–594, 1992. [321] R. E. Tuzun, D. W. Noid, and B. G. Sumpter. Computation of internal coordinates, derivatives, and gradient expressions: Torsion and improper torsion. J. Comput. Chem., 21(7):553–561, 2000.

262 BIOGRAPHICAL SKETCH Martin Barry Peters was born on April 3rd, 1980 in Tipperary, Republic of Ireland to Martin and Mary Peters. He attended primary and secondary school in New Inn and Cashel respectively. In June 2002 he received his B.A. Mod. degree in Computational

Chemistry from Trinity College, University of Dublin (TCD). While at Trinity he worked under the supervision of Dr. Isabel Rozas where he was introduced to computational chemistry and to his future significant other, Jane Montague. Martin enrolled in the PhD program at Penn State University (PSU) and worked with Prof. Kenneth M. Merz Jr. on the application of semi-empirical quantum mechanics to structure-based drug design.

In August 2005, Martin received his second degree, M. Sc. in chemistry from PSU. In September 2005 he moved to the University of Florida (UF) and joined the Department of Chemistry and the Quantum Theory Project to continue his work with Prof. K. M. Merz Jr in the pursuit of a doctoral degree. In his final year as a graduate student he applied for a government of Ireland postdoctoral fellowship in science, engineering and technology (IRCSET) which was successful. After graduating from UF, he joined Dr. David Lloyd at TCD as an IRCSET postdoctoral fellow in his group.

263