THE APPLICATION OF SEMIEMPIRICAL METHODS IN DRUG DESIGN
By
MARTIN B. PETERS
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007
1 c 2007 Martin B. Peters
2 For Jane
3 ACKNOWLEDGMENTS Words cannot describe my Jane. She is everything I can could ask for. She has stood by me even when I left Ireland to pursue my dream of getting my PhD. Thank you honey for your love, support and the sacrifices you have made for us. I thank my mother for always giving me tremendous support and for her words of wisdom and encouragement. I would also like to thank my two brothers, Patrick and Francis, and my two sisters, Marian and Deirdre, for all their encouragement and support. Kennie thank you for giving me the opportunity to work with you; I have truly enjoyed the experience. I would like to express my gratitude to all Merz group members
especially Kaushik, Andrew, Ken, Kevin, and Duane for their support and friendship. Also I would like to acknowledge the effort of Mike Weaver who helped by editing this dissertation.
4 TABLE OF CONTENTS page
ACKNOWLEDGMENTS ...... 4 LIST OF TABLES ...... 8 LIST OF FIGURES ...... 11 LIST OF ABBREVIATIONS ...... 15
ABSTRACT ...... 19
CHAPTER 1 INTRODUCTION ...... 21 2 THEORY AND METHODS ...... 25
2.1 Receptor-Ligand Binding Free Energy ...... 28 2.2 Computational Drug Design ...... 30 2.3 Molecular Mechanics ...... 32 2.4 Quantum Mechanics ...... 33 2.5 Ligand Based Drug Design ...... 34 2.5.1 3D-QSAR with QM descriptors ...... 35 2.5.2 Field-based Methods ...... 36 2.5.3 Spectroscopic 3-D QSAR ...... 37 2.5.4 Quantum QSAR and Molecular Quantum Similarity ...... 39 2.6 Receptor Based Drug Design ...... 40 2.7 Semiempirical Divide-And-Conquer Approach ...... 42 2.8 Pairwise Energy Decomposition (PWD) ...... 44 2.9 Quantum Mechanical Charge Models ...... 46 2.10 Comparative Binding Energy Analysis (COMBINE) ...... 47 2.11 SemiEmpirical Comparative Binding Energy Analysis (SE-COMBINE) .. 48 2.12 Graph Theory ...... 49 2.13 Statistical Methods ...... 54 2.14 Metalloproteins ...... 59 3 MODELING TOOL KIT++ ...... 67
3.1 Introduction ...... 67 3.2 Overview ...... 68 3.2.1 Development ...... 68 3.2.2 Library Hierarchy ...... 69 3.2.3 Molecule Library ...... 70 3.2.4 Graph Library ...... 77 3.2.5 MM Library ...... 78 3.2.6 GA Library ...... 78
5 3.2.7 Statistics Library ...... 80 3.2.8 Molecular Fragment Library ...... 80 3.2.9 Parsers Library ...... 82 3.3 Hybridization, Bond Order and Formal Charge Perception ...... 83 3.4 Ring Perception ...... 87 3.5 Addition of Hydrogen Atoms to Molecules ...... 92 3.6 Conformational Sampling ...... 94 3.7 Substructure Searching/ Functionalize ...... 98 3.8 Clique Detection/ Maximum Common Pharmacophore ...... 101 3.9 Superimposition ...... 102 3.10 Conclusions ...... 104
4 SEMIFLEXIBLE QUANTUM MECHANICAL ALIGNMENT OF DRUG-LIKE MOLECULES ...... 106 4.1 Introduction ...... 106 4.2 Implementation ...... 110 4.2.1 Ligand Conformational Searching ...... 110 4.2.2 Structural Alignment and Clique Detection ...... 111 4.2.3 Semiempirical Similarity Score ...... 112 4.3 Results and Discussion ...... 113 4.3.1 Data Set ...... 113 4.3.2 Carboxypeptidase A ...... 117 4.3.3 Glycogen Phosphorylase ...... 118 4.3.4 Immunoglobin ...... 119 4.3.5 Streptavidin ...... 121 4.3.6 Dihydrofolate Reductase ...... 123 4.3.7 Trypsin ...... 125 4.3.8 Estrogen Receptor ...... 128 4.3.9 Peroxisome Proliferator-Activated Receptorγ ...... 131 4.3.10 Human Carbonic Anhydrase II ...... 132 4.3.11 Thrombin ...... 136 4.3.12 Elastase ...... 136 4.3.13 Thermolysin ...... 140 4.4 Conclusions ...... 144 5 METAL CLUSTER MOLECULAR MECHANICS PARAMETERIZATION . . 146 5.1 Introduction ...... 146 5.2 Implementation ...... 148 5.2.1 Equilibrium Bond Lengths and Angles ...... 150 5.2.2 Force Constants ...... 150 5.2.3 Point Charges ...... 151 5.3 Zinc AMBER Force Field ...... 152 5.3.1 Protein Data Bank Survey of Zinc Containing Proteins ...... 154 5.3.2 Tetrahedral Zn Environment Force Field Parameterization ..... 157
6 5.4 Conclusions ...... 183 6 CONCLUSIONS ...... 189
APPENDIX A ALGORITHMS ...... 191
A.1 Subgraph Isomorphism Algorithm ...... 191 A.2 Maximum Common Pharmacophore ...... 193 B AMBER GRADIENTS ...... 194
B.1 Vector Math and Derivatives ...... 194 B.2 AMBER First Derivatives ...... 195 B.2.1 Bond ...... 195 B.2.2 Angle ...... 196 B.2.3 Dihedral ...... 197 B.2.4 Electrostatic ...... 201 B.2.5 van der Waals ...... 202 C FRAGMENT LIBRARY ...... 203 C.1 Terminal Fragments ...... 203 C.2 Two Point Linker Fragments ...... 208 C.3 Three Point Linker Fragments ...... 212 C.4 Four Point Linker Fragments ...... 214 C.5 Five Point Linker Fragments ...... 216 C.6 Three Membered Ring Fragments ...... 217 C.7 Four Membered Ring Fragments ...... 218 C.8 Five Membered Ring Fragments ...... 219 C.9 Six Membered Ring Fragments ...... 224 C.10 Greater than Six Membered Ring Fragments ...... 229 C.11 Fused Ring Fragments ...... 230 REFERENCES ...... 237
BIOGRAPHICAL SKETCH ...... 263
7 LIST OF TABLES Table page
2-1 Correspondence between Graph Theory and Chemical Terminology...... 53 3-1 Disulfide Bond Prediction Parameters...... 73 3-2 Meng Atomic Covalent Radii...... 84 3-3 Labute Algorithm Upper Bound Bond Conditions...... 85
3-4 Labute Algorithm Atom Hybridization Assignment...... 86 3-5 Labute Algorithm Lower Bound Single Bond Lengths...... 86 3-6 Labute Algorithm Bond Weights...... 87 3-7 Hydrogen Bond Lengths...... 94 3-8 Hydrogen Bond Angles...... 94
3-9 Hydrogen Bond Dihedrals...... 95 3-10 Dihedral Angles Available based on Bond Type...... 95 4-1 Compound Alignment Literature...... 107 4-2 Protein-Ligand Data Set...... 115
4-3 Statistics of CuTieP Performance...... 117 4-4 Carboxypeptidase A Ligand Alignments...... 118 4-5 Glycogen Phosphorylase Ligand Alignments...... 120 4-6 Immunoglobin Ligand Alignments ...... 123
4-7 Streptavidin Ligand Alignments ...... 125 4-8 Dihydrofolate Reductase Ligand Alignments...... 127 4-9 Trypsin Ligand Alignments ...... 130 4-10 Estrogen Receptor Ligand Alignments...... 132 4-11 PPARγ Ligand Alignments...... 132
4-12 40 Human Carbonic Anhydrase II Inhibitors...... 134 4-13 Human Carbonic Anhydrase II Results...... 138 4-14 Thrombin Ligand Alignments ...... 139
8 4-15 Elastase Ligand Alignments...... 140 4-16 Thermolysin Ligand Alignments...... 142 5-1 Metal Ions in the Protein Data Bank...... 146 5-2 Published Metalloprotein Force Fields Using the Bonded Plus Electrostatics Model...... 148 5-3 Metal-Donor Bond Target Lengths...... 153 5-4 Ideal Angles Used to Calculate Root Mean Square Deviations for Tetrahedral, Square Planar, Trigonal Bipyramidal, Square Pyramid and Octahedral Geometries...... 155 5-5 Tetrahedral Zinc Primary Ligating Residues...... 157 5-6 Zn-CCCC Cluster Bond Lengths and Force Constants...... 159
5-7 Zn-CCCC Cluster Angles and Force Constants...... 160 5-8 Zn-CCCH Cluster Bond Lengths and Force Constants...... 160 5-9 Zn-CCCH Cluster Angles and Force Constants...... 161 5-10 Zn-CCCH Cluster Angles and Force Constants...... 161
5-11 Zn-CCHH Cluster Bond Lengths and Force Constants...... 162 5-12 Zn-CCHH Cluster Angles and Force Constants...... 163 5-13 Zn-CHHH Cluster Bond Lengths and Force Constants...... 163 5-14 Zn-CHHH Cluster Angles and Force Constants...... 164 5-15 Zn-HHHH Cluster Bond Lengths and Force Constants...... 164
5-16 Zn-HHHH Cluster Angles and Force Constants...... 165 5-17 Cysteine Charges using ChgModA for the Zn-CCCC, -CCCH, -CCHH, and -CHHH Clusters...... 167
5-18 Cysteine Charges using ChgModB for the Zn-CCCC, -CCCH, -CCHH, and -CHHH Clusters...... 167 5-19 Histidine Charges using ChgModA for the Zn-CCCC, -CCCH, -CCHH, and -CHHH Clusters...... 168 5-20 Histidine Charges using ChgModB for the Zn-CCCC, -CCCH, -CCHH, and -CHHH Clusters...... 170 5-21 Zn-HHHO Cluster Bond Lengths and Force Constants...... 170
9 5-22 Zn-HHHO Cluster Angles and Force Constants...... 171 5-23 Zn-HHOO Cluster Bond Lengths and Force Constants...... 172 5-24 Zn-HHOO Cluster Angles and Force Constants...... 174 5-25 Zn-HOOO Cluster Bond Lengths and Force Constants...... 182
5-26 Zn-HOOO Cluster Angles and Force Constants...... 183 5-27 Histidine and Water’s Partial Charges using ChgModB for the Zn-HHHO, -HHOO, and -HOOO Clusters...... 184 5-28 Zn-HHHD and Zn-HHDD Cluster Bond Lengths and Force Constants...... 185
5-29 Zn-HHHD Cluster Angles and Force Constants...... 185 5-30 Zn-HHDD Cluster Angles and Force Constants...... 187 5-31 Histidine and Aspartate Residue Charges using ChgModB for the Zn-HHHD and -HHDD Clusters...... 188
C-1 Terminal Fragments...... 203 C-2 Two Point Linker Fragments...... 208 C-3 Three Point Linker Fragments...... 212 C-4 Four Point Linker Fragments...... 214
C-5 Five Point Linker Fragments...... 216 C-6 Three Membered Ring Fragments...... 217 C-7 Four Membered Ring Fragments...... 218 C-8 Five Membered Ring Fragments...... 219
C-9 Six Membered Ring Fragments...... 224 C-10 Greater than Six Membered Ring Fragments...... 229 C-11 Fused Ring Fragments...... 230
10 LIST OF FIGURES Figure page
2-1 Drug Development Process...... 25 2-2 The Iterative Drug Design Process...... 26 2-3 Thermodynamic Cycle of Receptor-Ligand Binding ...... 29 2-4 Computational Component of Drug Design...... 31
2-5 Hierarchy of QM methods used in SBDD...... 35 2-6 NMR QSAR...... 38 2-7 The Classic “Pac-man” Representation of Receptor-Ligand Binding...... 41 2-8 PWD Density Matrix Representation ...... 41 2-9 Schematic Diagram of the Human Carbonic Anhydrase II inhibitor Fragmentation...... 46 2-10 SE-COMBINE Descriptor Table...... 49 2-11 Schematic Diagram of a Trypsin Inhibitor Fragmentation...... 50
2-12 SE-COMBINE Intermolecular Interaction Map (IMM)...... 51 2-13 Graph Theory I...... 52 2-14 Graph Theory II...... 54 2-15 Principal Component Analysis (PCA) Schematic Diagram of the Matrices and Vectors Involved...... 58 2-16 Partial Least Squares (PLS) Schematic Diagram of the Matrices and Vectors Involved...... 60 2-17 Most Common Amino Acid Residues which Bond to Metal Ions...... 61
2-18 Zinc Metalloproteins...... 63 2-19 Copper Metalloproteins...... 64 2-20 Homo-Nuclear Metalloproteins...... 65 2-21 Hetero-Nuclear Metalloproteins...... 66
3-1 Computational Drug Design...... 67 3-2 Library Hierarchy as Implemented in MTK++...... 69
11 3-3 Core Class hierarchy of the Molecule Library as implemented in MTK++. ... 71 3-4 Class Hierarchy of the Parameters Component of the Molecule Class as Implemented in MTK++...... 72 3-5 Class Hierarchy of the Standard Library Component of the Molecule Class as Implemented in MTK++...... 72 3-6 Disulfide Bond in Proteins...... 73 3-7 The Structural Types of the Histidine Residue...... 74
3-8 Class Hierarchy of the Molecule Component of the Molecule Class as Implemented in MTK++...... 77 3-9 Class Hierarchy of the Graph Library as Implemented in MTK++...... 78 3-10 Class Hierarchy of the MM library as Implemented in MTK++...... 79
3-11 Class Hierarchy of the GA Library as Implemented in MTK++...... 81 3-12 Class Hierarchy of the Statistics Library as Implemented in MTK++...... 82 3-13 Class Hierarchy of the Parsers Library as Implemented in MTK++...... 83 3-14 Hybridization, Bond Order, and Formal Charge Perception Using the Labute Algorithm...... 88 3-15 Ring Perception...... 90 3-16 Ring Perception Contd...... 91 3-17 Aromatic, Non-aromatic, and Anti-aromatic Rings...... 93
3-18 Hydrogen Bond...... 94 3-19 Rotatable Bond Types...... 96 3-20 Systematic Conformational Searching...... 96 3-21 Conformer Generation...... 97
3-22 Ullman Subgraph Isomorphism Illustration...... 99 3-23 Clique Detection Illustration...... 103 3-24 Molecular Superposition...... 104 4-1 Carboxypeptidase A Ligands...... 119 4-2 1CBX Conformer Analysis...... 120
4-3 Carboxypeptidase A Alignment Results...... 121
12 4-4 Glycogen Phosphorylase Ligands...... 122 4-5 Glycogen Phosphorylase Alignment Results...... 123 4-6 Immunoglobin Ligands ...... 124 4-7 Immunoglobin Alignment Results...... 125
4-8 Streptavidin Ligands ...... 126 4-9 Streptavidin Alignment Results...... 127 4-10 Dihydrofolatreductase Ligands...... 128 4-11 Trypsin Inhibitors...... 129 4-12 Trypsin Alignment Results...... 130
4-13 Estrogen Receptor Ligands...... 131 4-14 Peroxisome Proliferator-Activated Receptor γ Agonists...... 133 4-15 HCA II Ligands...... 137 4-16 Thrombin Inhibitors...... 139
4-17 Elastase Ligands...... 141 4-18 Elastase Alignment Results...... 142 4-19 Thermolysin Inhibitors...... 143 4-20 Thermolysin Alignment Results...... 144
5-1 Approaches to Incorporate Metal Atoms into Molecular Mechanics Force Fields. 147 5-2 MCPB Flow Diagram...... 150 5-3 Metal Ligand Geometries Perceived Using Harding’s Rules...... 154 5-4 Zinc Coordination Geometry Distribution from the PDB...... 156 5-5 The Most Common Tetrahedral Zinc Coordinating ligands Combination Distribution...... 158 5-6 Zn-S Bond Length Distributions in CCCC, CCCH, CCHH, and CHHH Tetrahedral Environments...... 172
5-7 Box Plots of Zn-S/N Bond Lengths in CCCC, CCCH, CCHH, CHHH, and HHHH environments...... 173 5-8 Tetrahedral Zn-O(Asp/Glu) and Zn-N(His) Bond Length Distributions...... 175
13 5-9 ZAFF Flow Diagram...... 176 5-10 Zn-CCCC Cluster Models (PDB ID: 1A5T)...... 176 5-11 Zn-CCCH Cluster Models (PDB ID: 1A73 and 2GIV)...... 177 5-12 Zn-CCHH Cluster Models (PDB ID: 1A1F)...... 178
5-13 Zn-CHHH Cluster Models (PDB ID: 1CK7)...... 178 5-14 Zn-HHHH Cluster Models (PDB ID: 1PB0)...... 179 5-15 Correlation between Zn-S and Zn-N Bond Lengths and Calculated Force Constants through the Series CCCC, CCCH, CCHH, CHHH, and HHHH...... 180
5-16 Zn-HHHO Cluster Models (PDB ID: 1CA2)...... 181 5-17 Zn-HHOO Cluster Models (PDB ID: 1VLI)...... 181 5-18 Zn-HOOO Cluster Models (PDB ID: 1L3F)...... 182 5-19 Zn-HHHD and Zn-HHDD Cluster Models (PDB ID: 2USN and 1U0A)...... 186
14 LIST OF ABBREVIATIONS Abbreviation page
PDB ProteinDataBank ...... 21 DD DrugDesign ...... 25 NDA NewDrugApplication ...... 25 IND InvestigationalNewDrug ...... 25
FDA FoodandDrugAdministration ...... 25 ADME Absorption, Distribution, Metabolism, and Excretion ...... 25 SBDD Structure-BasedDrugDesign ...... 30 LBDD Ligand-BasedDrugDesign ...... 30
MM MolecularMechanics ...... 32 QM QuantumMechanics ...... 33 HF HartreeFock ...... 33 DFT DensityFunctionalTheory ...... 33 SE SemiEmpirical ...... 33
MNDO Modified Neglect of Differential Overlap ...... 33 AM1 AustinModel1 ...... 33 PM3 ParametricModel3 ...... 33 PDDG/PM3 Pairwise Distance Directed Gaussian modification ofPM3 ...... 33
SCC-DFTB Self-Consistent-Charge Density-Functional Tight-Binding ...... 33 RBDD Receptor-Based DrugDesign ...... 34 QSAR Quantitative Structure Activity Relationship ...... 34 MLR MultipleLinearRegression ...... 34
PCR Principal Component Regression ...... 34 PLSR PartialLeastSquares Regression ...... 34 CNNs ComputerNeuralNetworks ...... 34 HOMO Highest Occupied Molecular Orbital ...... 35
15 LUMO Lowest Unoccupied Molecular Orbital ...... 35
CODESSA COmprehensive DEscriptors for Structural and Statistical Analysis 35 CoMFA Comparative Molecular Field Analysis ...... 36 CoMSIA Comparative Molecular Similarity Indices Analysis ...... 36 PLS PartialLeastSquares ...... 36
PIE ProbeInteractionEnergy ...... 36 QSM QuantumSimilarityMeasure ...... 39 CSI Carb´oSimilarityIndex ...... 39 QQSAR QuantumQSAR ...... 39
QSSA Quantum Similarity Superposition Algorithm ...... 39 QTMS Quantum Topological Molecular Similarity ...... 39 BCPs BondCriticalPoints ...... 39 AIM Atoms-In-Molecules ...... 39 DnC Divide-and-Conquer ...... 42
NDDO Neglect of Differential Diatomic Overlap ...... 43 SASA Solvent Accessible Surface Area ...... 43 PWD Pairwise EnergyDecomposition ...... 44 CNDO Complete Neglect of Differential Overlap ...... 44
CM1 ChargeModel1 ...... 46 CM2 ChargeModel2 ...... 46 RESP Restrained ElectroStatic Potential ...... 46 MK Merz-Singh-Kollman ...... 46
COMBINE Comparative Binding Energy Analysis ...... 47 SE-COMBINE SemiEmpirical-Comparative Binding Energy Analysis ...... 48 IMM InterMolecular interaction Map ...... 48 LOO Leave-One-Out ...... 55 PRESS predicted residual sum of squares ...... 55
16 SDEC Standard Deviation of Error of Calculations ...... 55
SDEP Standard Deviation ofErrorPrediction ...... 55 RMSD RootMeanSquaredDeviation ...... 56 PCA Principal Component Analysis ...... 56 PC PrincipalComponent ...... 56
CYS Cysteine ...... 59 MET Methionine ...... 59 ASP AsparticAcid ...... 59 GLU GlutamicAcid ...... 59
HIS Histidine ...... 59 HCAII HumanCarbonicAnhydrase II ...... 60 MTK++ ModelingToolKit++ ...... 67 API Application Programming Interface ...... 67 GA GeneticAlgorithm ...... 68
BLAS Basic Linear Algebra Subprograms ...... 68 LAPACK Linear Algebra PACKage ...... 68 GAFF Generalized AMBER Force Field ...... 80 MEP Molecular Electrostatic Potential ...... 106 vdW vanderWaals ...... 106 GFs GaussianFunctions ...... 106 GA GeneticAlgorithm ...... 106 RFO RationalFunctionOptimization ...... 106
RIPS Random Incremental Pulse Search ...... 106 BFGS Broyden-Fletcher-Goldfarb-Shanno ...... 106 SD SteepestDescent ...... 106 NR Newton-Raphson ...... 106 MCP MaximumCommonPharmacophore ...... 106
17 ASA Atomic Shell Approximation ...... 106
SA SurfaceArea ...... 106 MO MolecularOrbital ...... 106 DHFR DihydrofolateReductase ...... 113 PPARγ Peroxisome Proliferator-Activated Receptor γ ...... 113
ER EstrogenReceptor ...... 113 ESP ElectroStaticPotential ...... 146 UFF UniversalForceField ...... 146 CCSD Crystallographic Structural Database ...... 146
MCPB Metal Center Parameter Builder ...... 148
18 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
THE APPLICATION OF SEMIEMPIRICAL METHODS IN DRUG DESIGN By Martin B. Peters August 2007
Chair: Kenneth M. Merz Jr. Major: Chemistry The application of quantum mechanical methods in de novo drug design is currently quite limited in both scope and utility. This thesis outlines where these methods are placed in this process and where they can be improved on. Chapters one and two of this dissertation describe the drug development process and current methods used to calculate the free energy of receptor-ligand binding. Some of the computational tools used in drug design are discussed such as scoring functions, molecular mechanics, quantum mechanics, semiempirical pair-wise energy decomposition, comparative binding energy analysis, the SE-COMBINE approach and popular 3D-QSARs approaches. The remaining chapters of this work describes the development and application of a package of computational chemistry C++ libraries called the Modeling ToolKit++
(MTK++). This toolkit was used to develop a new technique to superimpose drug-like molecules onto one another using a quantum mechanical score function. Obtaining the correct alignment of two molecules to reproduce the pose within a protein active site is a challenging problem. This new method was validated on almost 90 protein-ligand complexes for which x-ray crystallographic data was available. MTK++ was also used to develop a generalized tetrahedral Zinc force field for metalloprotein molecular dynamics simulations. It is desirable to model metalloprotein systems using MM models because one can carry out simulations to address important
19 structure/function and dynamics questions that are not currently attainable using QM and QM/MM based methods. Until now force fields for metalloproteins were built by hand through a convoluted process. The creation of a computer program to do this removes the human error factor. This program was used to build force fields for 10 Zinc tetrahedral active sites. This required the parameterization of bond and angle force constants and the calculation of partial charges. MTK++ was designed to automatically perceive metal centers and assign parameters necessary to carry out MM or MD calculations.
20 CHAPTER 1 INTRODUCTION Drug discovery has evolved from being serendipitous to a rather rational process of design. High-throughput screening, combinatorial chemistry, the human genome project, and computational methods have been developed to this end. Nonetheless, the cost of creating a drug has increased exponentially over the last 50 years [1] without the number of new drugs getting to the market increasing accordingly. The most plausible reason is a lack of fundamental understanding of molecular recognition, binding and ultimately drug
delivery processes. Computational medicinal chemistry spans a broad spectrum of disciplines including theoretical, computational, and structural chemistries. Theoretical chemistry involves the development of new and improved theories whereas computational chemistry entails the
application of established theoretical tools to chemical problems. Structural chemistry techniques such as X-ray crystallography and NMR spectroscopy have played a significant role in facilitating our understanding of molecular recognition and interaction. Although computational medicinal chemistry cannot design new drugs on its own, it has been shown that it can play a role in predicting binding free energies and geometries of receptor-ligand
complexes. Examples include the development of the HIV protease inhibitor, saquinavir, by “in silico” design as a transition-state analogue [2] and the rational design of an Angiotensin-Converting Enzyme (ACE) inhibitor called captopril [3]. The computational techniques employed to aid the drug design process include virtual
screening, docking, and scoring with the results or “hits” utilized by medicinal chemists [4]. Computational methods vary in cost; screening can be carried out on large databases of compounds, while scoring and docking are generally carried out on a smaller number of structures. Screening attempts to predict physicochemical properties of molecules such as aqueous solubility and by doing so reduces the number of molecules with poor drug-like properties being synthesized. Docking is a technique of placing a drug candidate
21 into the active site of a receptor. The docked pose of a ligand in the active site of a receptor can be scored using knowledge-, empirical-, or physics-based methods with the latter being more expensive [5]. Also computational methods lend themselves to virtual combinatorial chemistry which can be used to optimize the complementarities between a receptor and a ligand. Although this can also be done experimentally, the main reason of the computational approach is the reduction of cost and time. The computational prediction of binding free energies is still not an exact science. However, utilizing current computer hardware and theoretical technologies, problems that were hitherto reputedly unfeasible are now tractable. There are two areas where increased computational power can be used for the accurate prediction of binding free energies: increased sampling of conformational space, and interaction energy calculations using complete Hamiltonians [6]. The use of both will be investigated in this thesis. This dissertation describes the application of quantum mechanical methods in bio- and medicinal chemistry. The following chapters describe the development of computational chemistry modeling software, the flexible alignment of drug-like molecules, and the generation of a Zinc force field for metalloprotein simulations and drug design applications.
In Chapter 2 the industrial drug design process is outlined as an overview of why computational tools are used in drug design. The thermodynamic basis and current methods used to calculate the free energy of receptor-ligand binding are described. The equations of binding are derived in order to reflect the current understanding of binding, both experimentally and computationally. Some of the computational tools used in
Structure Based Drug Design (SBDD) are discussed, including scoring functions [4], Molecular Mechanics (MM), Quantum Mechanics (QM), SemiEmpirical (SE) Pair-Wise energy Decomposition (PWD) [7], the Comparative Binding Energy Analysis (COMBINE) [8], and the SE-COMBINE [9] approaches. Popular 3D-QSARs approaches are also outlined including CoMFA (Comparative Molecular Field Analysis) [10] and CoMSIA
22 (Comparative Molecular Similarity Indices Analysis) [11, 12] and two multivariate statistical tools; Principal Component Regression (PCR) [13] and Partial Least Squares (PLS) [14]. Chapter 3 outlines the design and development of the Modeling ToolKit++
(MTK++) package of C++ libraries for the use of QM methods in drug design. The algorithms such as atomic hybridization and formal charge determination, bond order and ring perception, substructure searching and clique detection are described in detail with numerous illustrations. The impetus for this work was to create a computational chemistry platform where QM methods could be conveniently incorporated in drug
discovery applications. This work was fundamental to this thesis and all modeling in later chapters used this package. The fourth Chapter describes a method to flexibly align drug-like molecules onto one another using a semiempirical scoring function. The alignment of two bodies is a
mathematical problem; however, the challenge is to reproduce the pose seen in x-ray crystallographic studies. Traditionally, molecular superposition has been carried out using empirical scoring functions due to their speed. The goal of this research was to investigate the applicability of semiempirical methods in molecular alignment and its ability to do
so was validated against over 80 protein-ligand complexes from the Protein Data Bank (PDB) [15]. The fifth Chapter outlines the development of a molecular mechanics force field (FF) for tetrahedral Zinc metalloproteins suitable for the AMBER suite of programs [16]. Several issues regarding the modeling of metalloproteins were addressed. The first goal
was to develop software to conveniently handle metalloprotein structures. The program MCPB (Metal Center Parameter Builder) was created to build and validate metalloprotein FFs for use in molecular simulations to study structure, function, and dynamics. Secondly, the automated perception of metal centers in proteins was undertaken and gave rise to
the program called pdbSearcher. This software was used to survey the PDB for all Zinc
23 containing metalloproteins. The most abundant primary shell combinations bound to Zn atoms were extracted and the FFs generated with the resulting parameters analyzed in detail. Finally, Chapter 6 provides a brief summary of the work presented in this dissertation.
I hope this dissertation demonstrates the utility of current quantum mechanical approaches in the areas of drug design and metalloprotein modeling. The use of quantum mechanical methods in drug design can be viewed as the final frontier due to the fact that these methods describe molecular interactions from first principles [6]. Nevertheless, the use of quantum mechanics over classical approaches brings extra expense and so it is necessary to show that these methods can be superior to simpler models.
24 CHAPTER 2 THEORY AND METHODS Designing a drug (a molecule which affects biological processes without causing injury) requires numerous steps from its inception to its introduction into the market.
This process takes approximately 10-15 years as shown in Fig. 2-1 and can cost in the order of a half billion dollars. This is due to both the vastness of chemical space [17–21] and the cost of research and testing [1].
Formulation Research Process Development
Compound Safety IND Phases NDA FDA Discovery Testing Preparation I,II,III Preparation Review
Pharmacokinetics Toxicology
Basic Preclinical IND Clinical NDA Submission Research Development Submission Development
3-4 1 6-8 2-3 Ongoing years year Years years
Millions 1000 10 1 Compounds Compounds Compounds Compound
Figure 2-1. Drug Development Process. Adapted from http://www.netsci.org/. An IND (Investigational New Drug) is prepared and submitted to the FDA (Food and Drug Administration) at the end of the preclinical phase of drug development. With good results from the clinical phase an NDA (New Drug Application) is submitted to the FDA for approval to release the drug to the general public.
The pre-clinical phase of Drug Design (DD) is carried out using an iterative process (first three columns of Fig. 2-1). It starts with some knowledge of a target, i.e known natural substrate or a crystal/NMR structure of the target or receptor. The target is chosen based on some known chemical feature of a biological disease. The design cycle takes many steps such as computational design, ligand design, synthesis, biochemical
25 evaluation, and crystallography converging to a drug candidate or lead as shown in Fig. 2-2 [22]. During each cycle of this process different computational tools are used with varying costs and accuracies which will be discussed in more detail later in this chapter. At the end of the pre-clinical phase an IND (Investigational New Drug) is prepared to
allow a company to test the drug in humans. After the IND is approved by the FDA (Food and Drug Administration) the clinical stage (4th column of Fig. 2-1) begins. Phase I of the clinical trials tests the toxicity, pharmacokinetics or ADME (Absorption, Distribution, Metabolism, and Excretion) properties, and dosage on approaximately 50 healthy volunteers. Phase II evaluates the drugs effectiveness and side effects on volunteer
patients ( 500) and it is at this stage where most adverse effect of the drugs use are ≈ observed. The final phase, Phase III, of clinical trials determines the effects of long term use on a large pool of volunteer patients. After phase III a company prepares an NDA (New Drug Application) and submits it to the FDA for approval to release the drug to the general public. The NDA contains results of all clinical studies and once approved by the FDA the drug can be marketed. After release the company carries out post-marketing surveillance of the drugs effectiveness in a so-called phase IV.
Target Information Crystallographic Analysis Drug Lead Computational Biochemical Testing Ligand Design
Synthesis
Figure 2-2. The Iterative Drug Design Process. Adapted from Babine and Bender [22].
26 Drug targets or receptors include enzymes, ion channels, nuclear hormone receptors, and DNA, which interact with endogenous physiological substances such as hormones and neurotransmitters. There are currently over 1200 drugs approved by the FDA for the therapeutic use in the United States, 25% of which target enzymes [23]. The majority of enzyme-targeted drugs are enzyme-substrate based and most act via non-covalent interactions. Drugs that mimic the effects of endogenous regulatory compounds are called agonists, while compounds that do not have 100% activity are termed partial agonists [24]. Drugs that bind to receptors but have no activity and prevent endogenous compounds from binding are termed antagonists or inhibitors. There are two main types of enzymatic inhibition, reversible and irreversible. Reversible inhibition occurs through competitive, noncompetitive and uncompetitive mechanisms. Diuretics used to control blood pressure and many anti-depressive agents, for example antagonists of dopamine receptors, are reversible competitive inhibitors. These drugs compete for the same binding site as the natural substrate, but the enzyme cannot process the inhibitor, thus preventing catalytic activity. Non-competitive or allosteric inhibitors bind to different regions of the enzyme and do not compete for the binding site. However, the process of binding the inhibitor can change the shape of the active site thus preventing catalytic activity. Uncompetitive inhibition takes place when the inhibitor only binds the enzyme-substrate complex, consequently preventing catalysis. Irreversible inhibition occurs when the inhibitor covalently attaches to the enzyme active site such as inhibitors of Carbonic Anhydrase [25]. Structural determination of receptors or complexes is often carried out by x-ray crystallography. It should be noted that the atomic positions from crystallography have an associated error and generally can be in the order of 1/6 of the resolution ( 0.4A˚ uncertainty from a 2.4A˚ resolution structure) [22]. ≈ A fundamental understanding of the interactions between receptors and ligands is necessary to the design of new drugs. These forces include ionic or electrostatic effects, ion-dipole and dipole-dipole interactions, charge transfer, van der Waals, and hydrophobic
27 interactions. Molecules with high biological activity usually possess a shape that is complementary (hydrophobic, electrostatic, and polar contacts are paired upon binding) to that of the receptors active site as first proposed by Fischer (“lock-and-key” hypothesis). 2.1 Receptor-Ligand Binding Free Energy
In the simplest case, receptor-ligand binding corresponds to a single ligand molecule forming a 1 : 1 complex with a receptor that contains only a single binding site as shown in Eq. 2–1. R represents the receptor, L the ligand and R L is the complex, where k and · 1 k−1 are the association and dissociation rate constants, respectively.
k1 R + L ↽⇀ R L (2–1) −−k−−−1 ·
At equilibrium, association of a receptor and ligand occurs at the same rate as dissociation and the equilibrium constants, Ka and Kd, can be defined as:
[RL] 1 Ka = = (2–2) [R][L] [Kd]
It is a common practice to use Kd for practical reasons as it has units of concentration. Kd is the concentration of free ligand at which half of the receptor binding sites at equilibrium are occupied. Small values of Kd correspond to a high affinity between the receptor and ligand. To gain a fundamental understanding of receptor-ligand binding one must begin with a thermodynamic description. The Gibbs free energy is most often used in biochemistry as
binding experiments are carried out under conditions of constant temperature, pressure, and number of particles. ∆G, Eq. 2–3 is the free energy change for the reaction, ∆H and ∆S are the enthalpy and entropy changes respectively, and T is the temperature.
∆G = ∆H T ∆S (2–3) −
The change in free energy can be expressed in terms of the equilibrium Kd as follows:
∆G = ∆G◦ RTlnK (2–4) − d
28 where ∆G◦ is the standard state (1 M, 1 bar) free energy change and R is the gas constant. When complex association and dissociation reach equilibrium, ∆G = 0, the expression takes the form:
◦ ∆G = RTlnKd (2–5)
Since free energy is a state function it can be calculated and compared with experimental
values. The free energy of binding, ∆Gbind, is calculated by determining the free energy of
◦ reactants, (∆GR + ∆GL), and products, ∆GRL, separately. The superscript ” ” is dropped from the remaining equations for simplicity; however, it is implied.
∆G = ∆G (∆G + ∆G ) (2–6) bind RL − R L
∆G R + L gas RL −→ R L RL ∆Gsolv ∆Gsolv ∆Gsolv ∆G R + L solv RL −→ y y y Figure 2-3. Thermodynamic Cycle of Receptor-Ligand Binding
Using the thermodynamic cycle in Fig. 2-3 and Eq. 2–6, the free energy of binding in
solv gas solution, ∆Gbind, can be fully decomposed in Eq. 2–7 [26]. ∆Gbind is the free energy of complexation in the gas phase. This term is dominated by the enthalpic contributions from steric and electrostatic interactions. ∆∆Gsolv, is the solvation free energy of
L R complexation, which incorporates the desolvation of the ligand, ∆Gsolv, receptor, ∆Gsolv,
RL and complex, ∆Gsolv.
solv gas ∆Gbind = ∆Gbind + ∆∆Gsolv (2–7)
gas where ∆Gbind and ∆∆Gsolv are defined by equations 2–8 and 2–9.
∆Ggas = ∆Hgas T ∆Sgas (2–8) bind bind − bind ∆∆G = ∆GRL ∆GR ∆GL (2–9) solv solv − solv − solv
29 For tight binding ligands the interactions in the complex are significantly stronger than those of the receptor and ligand alone in solution. Also the favorable enthalpic interactions must compensate the entropic loss of conformational degrees of freedom for both the receptor and ligand plus the three rotational and three translational degrees
of freedom. It should be noted that small variations in a complex’s stability (∆G) in
kcal/mol corresponds to large differences in affinity (Kd). For example, a difference of 5kcal/mol coincides with three orders of magnitude variation in observed affinity. 2.2 Computational Drug Design
The computational components of the drug design process take place during the initial stages of each iterative cycle as shown in Fig. 2-2 and the main reasons for their use is to reduce costs and provide atomic level insight into receptor-ligand interaction. This are can be broken down in to two area: Structure-Based Drug Design and Ligand-Based
Drug Design . The former requires structural knowledge of the receptor while the latter does not and both will be discussed in detail below. The early iterations of the drug design process involved the searching or screening of databases of molecules such as ZINC [27, 28] and other combinatorial libraries [29] for compounds which may be active
against the target [30–32], thus separating drugs from non-drugs [33–35]. Screening can involve similarity/dissimilarity searching [36] against a known active/inactive molecule. Compounds can be compared to each other in 1D [37], 2D [38] or 3D [39, 40] with the later technique being the most expensive. Simple counting techniques [41] such as
Lipinski’s “rule-of-five” [42] are also used to filter out non-drug molecules. Screens are used to predict ADME properties [43] such aqueous solubility [44], hepatotoxicity [45], P450 inhibition [46], and absorption [47]. Screens are also carried out to predict the synthetic accessibility of compounds thus allowing for later functional group optimization [48]. Subsequent “hits” from a screen serve as lead compounds for medicinal chemists.
De novo drug design [49–51] is another tool used to identify novel lead compounds. This
30 technique “grows” molecules in the active site of a receptor or pseudoreceptor from alignemnt of known active molecules [52].
Target
million Database milli-seconds Compounds Screening
1000s Docking seconds/ Compounds Scoring hours
100s Lead seconds/ Compounds Optimization hours
Drug Candidate
Figure 2-4. Computational Component of Drug Design. Timings are per compound.
Lead, or “drug-like”, compounds are expected to have good pharmacokinetics and be accessible to synthetic modification. The transition from a lead compound to a drug candidate involves optimizing structural and chemical complementarities with the receptor. Docking and scoring are tools to measure the complementary between lead and receptor [4]. Docking is the process by which a ligand structure is placed in the active site of the receptor while scoring predicts the binding free energy of complex formation. Lead optimization is often used to optimize the pharmacokinetics through functional group substitution. A schematic of the computational aspect of drug design is shown in Fig. 2-4. This is drawn as a funnel to highlight that the number of compounds decreases from the top to bottom; however, most often the expense of computational tools used increases.
31 Various approaches have emerged to calculate or predict the binding free energy. These have met with varying degrees of success. They include physics-, empirical-, and knowledge-based scoring functions [5, 53–57], and various QSAR approaches [10, 11]. The results of empirical and knowledge-based scoring functions are highly dependent
on parameterization and the calculation of binding free energies of compounds unlike those in the training set can yield spurious results. Physics based scoring functions try to model each component of Eq. 2–7 from first principles. Physics-based techniques and QSAR approaches are introduced in the following sections and their advantages and disadvantages in determining the free energy of binding are discussed.
2.3 Molecular Mechanics
Molecular Mechanics (MM) force fields such as AMBER [16, 58–60], CHARMM [61], MMFF [62–69], OPLS [70], and MM3 [71] can be used to calculate the enthalpic component of the binding free energy between the receptor and ligand. The AMBER energy function, Eq. 2–10, contains bond, angle, dihedral, and non-bonded terms. The bond and angle terms are represented by harmonic expressions. The van der Waals term is a 6-12 potential, and the electrostatic is expressed as a
Coulombic interaction with atom centered point charges.
V E = K (r r )2 + K (θ θ )2 + n [1 + cos(nφ γ)] + total r − eq θ − eq 2 − bondsX anglesX dihedralsX A B q q ij ij + i j (2–10) r12 − r6 εr i A truncated Fourier series represents the dihedral term, where Vn is the barrier height, n is the periodicity, φ is the calculated dihedral angle and γ is the phase difference. 32 The fourth term describes the steric interaction as a Lennard-Jones potential, where ∗12 ∗6 rij is the distance between atoms i and j. Aij = εijrij and Bij = 2εijrij are parameters ∗ ∗ ∗ ˚ ∗ that define the shape of the potential where rij = ri + rj in A, ri is the van der Waals radius for atom i, and εij = √εi εj, εi is the van der Waals well depth in kcal/mol ∗ and q are the atom-centered point charges. A vigorous derivation of the gradients of the AMBER function are described in Appendix B. 2.4 Quantum Mechanics Higher order molecular interactions such as polarization and charge transfer are neglected in molecular mechanics force fields due to their point charge based approaches. Quantum mechanical techniques intrinsically include such interactions. The high computational cost of ab initio methods such as Hartree-Fock (HF) and Density Functional Theory (DFT) restrict their use to small systems such as organic molecules, protein active sites, and metal clusters. Thanks to the work by Pople, Dewar and Stewart amongst others, the Roothaan-Hall equations have been approximated and parameterized to give us a series of so-called SemiEmpirical (SE) methods. The most commonly used SE methods are derived from the MNDO (Modified Neglect of Differential Overlap) [72] method including AM1 (Austin Model 1) [73], PM3 (Parametric Model 3) [74, 75], MNDO/d (MNDO with d orbitals) [76] and PDDG/PM3 (Pairwise Distance Directed Gaussian modification of PM3) [77]. Recently, DFT methods have been approximated creating the SCC-DFTB (Self- Consistent-Charge Density-Functional Tight-Binding) method [78, 79]. The SCC-DFTB approach has been compared to the traditional SE methods, AM1 and PM3, with comparable errors in predicting heats of formation for a set of 622 neutral molecules; however, errors were higher than those from the PDDG/PM3 method [80]. SE methods can be used to calculate the total electrostatic energy of a molecular system, which is the sum of the electronic energy, Eel, and core-core repulsion, Ecore−core Etot = Eel + Ecore−core (2–11) 33 where Eel and Ecore−core are described in equations 2–12 and 2–13. In these equations H is the one-electron matrix, F is the Fock matrix, and P represents the density matrix. Z is the nuclear charge on the atom, RAB is the atomic separation between A and B, and N is the total number of atoms. 1 E = (H + F ) P (2–12) el 2 µν µν µν µ ν X X N N Z Z E = A B (2–13) core−core R A=1 B>A AB X X The use of QM in SBDD can be divided into two broad categories, receptor-based and ligand-based methods (Fig. 2-5). Receptor-Based Drug Design (RBDD) methods include scoring-, QM/MM and comparative binding energy (COMBINE)-type methods. RBDD requires either an X-ray crystal or NMR structure of ligands in complex with the relevant receptor. Ligand-based drug design techniques include various Quantitative Structure-Activity Relationship (QSAR) methods, which rely only on knowledge of the ligand structure. In general, QSAR can be conducted using two-dimensional (2D) or three-dimensional (3D) structures; however, the user must utilize 3D structures when using QM because of the need to have an all-atom description of the nuclei and associated electrons [6]. 2.5 Ligand Based Drug Design One of the oldest tools used in rational drug design is QSAR (Quantitative Structure Activity Relationship) [81]. QSAR models are derived for a set of compounds with dependent variables (activity values e.g. Ki, IC50), and a set of calculated molecular properties or independent variables called descriptors. Each compound in the data set is assumed to be in its active conformation. Models are generated using statistical techniques such as Multiple Linear Regression (MLR), Principal Component Regression (PCR) [13], Partial Least Squares Regression (PLSR) [82], and Computer Neural Networks (CNNs) 34 QM + SBDD Ligand-Based Receptor-Based Field-based Scoring COMBINE eg. QM-QSAR eg. QMScore eg. SE-COMBINE 3D-QSAR QM/MM eg. AMPAC+CODESSA eg. DivCon/AMBER Figure 2-5. Hierarchy of QM methods used in SBDD. [83] to name a few. Ligand-based methods can be further divided into two categories, 3D-QSAR and field-based methods. Both will be touched on below. 2.5.1 3D-QSAR with QM descriptors The descriptors used in 3D-QSAR are usually divided into three categories: 1) Electronic, such as HOMO and LUMO energies, 2) Topological, for example connectivity indices, and 3) Geometric such as moment of inertia. The models in all cases are often created using multivariate statistical tools due to the large number and high degree of collinearity of descriptors. An excellent review by Karelson, Lobanov, and Katritzky provides details of QM based descriptors used in QSAR programs such as COmprehensive DEscriptors for Structural and Statistical Analysis (CODESSA) [84]. These include those that can be observed experimentally, such as dipole moments, and those that cannot, such as partial atomic charges. Clark and co-workers have recently used AM1-based descriptors to distinguish between drugs and non-drugs and to understand the relationship between descriptors and their physical properties [85]. Most descriptors are calculated at the semiempirical level of theory using programs such as AMPAC or MOPAC. However, with computer speed increasing steadily the use of 35 ab initio and DFT methods are becoming increasingly common. These methods allow the descriptors to be calculated from first principles. Yang and co-workers examined various DFT-based descriptors to generate models for a series of protoporphyrinogen oxidase inhibitors. It was shown that the DFT-based model out performed the PM3 based model [86]. 2.5.2 Field-based Methods CoMFA (Comparative Molecular Field Analysis) [10] and CoMSIA (Comparative Molecular Similarity Indices Analysis) [11, 12] are field-based or grid-based methods where all the compounds in the data set are aligned on top of one another and steric and electrostatic descriptors are calculated at each grid point using a probe atom. As a result there are many more descriptors than molecules, therefore a Partial Least Squares (PLS) data analysis is used to generate linear equations. A study by Weaver and co-workers compares different field-based methods for QSAR including CoMFA, and CoMSIA finding that field-based methods provide a robust tool to aid medicinal chemists [87]. Absent from the traditional MFA approaches are quantum mechanically derived descriptors of electronic structure. QMQSAR is a relatively new technique where semiempirical QM methods are used to develop quantum molecular field-based QSAR models [88]. Placing the aligned training set ligands into a finely spaced grid produces quantum molecular fields, where each ligand is characterized by a set of Probe Interaction Energy (PIE) values. A PIE is defined as the “electrostatic potential energy obtained by placing a positively charged carbon’s 2s orbital at a given grid point gi and summing the attractive and repulsive potentials experienced by that electron as it interacts with the field of the ligand L”: P IE = s s V (L) (2–14) −h i i| i Natoms ∗ ∗ z χ (r2)χ (r1) = χ∗ (r )χ (r ) α P µ µ′ dr dr si 1 si 2 r r − µµ′ r r 2 1 r1 α=1 " 1 α µ∈α µ′∈α r2 1 2 #! Z X | − | X X Z | − | 36 The nuclear charge zα is simply the number of valence electrons on atom α and the notation µ α indicates the set of valence atomic orbitals centered on atom α. Density ∈ matrix elements Pµµ′ are given by the following sum over the occupied MOs: Nocc Pµµ′ =2 cµkcµ′k (2–15) Xk=1 When applied to data sets containing corticosteroids, endothelin antagonists, and serotonin antagonists, linear regression models were produced with similar predictability compared to various CoMFA models. 2.5.3 Spectroscopic 3-D QSAR The Spectroscopic QSAR methods [89, 90] include EVA (Vibrational frequencies), [91–98] EEVA (MO energies),[99–102] and CoSA (NMR chemical shifts)[103]. It is a requirement of 3-D QSAR that all compounds which are being studied contain the same number of descriptors. However, none of the above techniques obey this requirement. The number of vibrational frequencies and MOs are dependent on the number of atoms, N, in a molecule (3N 6, or 3N 5 if linear). While the number of NMR chemical shifts depends − − on the number of atomic isotopes with NMR active nuclei, N. A solution of this problem is to force the information onto a bound scale using a Gaussian smoothing technique, where the upper and lower limits of this scale are consistent for all compounds in the data set. A Gaussian kernel, f(x), with a standard deviation of σ is placed over each calculated point, EVA, EEVA or NMR chemical shift as shown in Eq. 2–16. Summing the amplitudes of the overlaid Gaussian functions at intervals x along the defined range results in the descriptors for each molecule, fˆ(x), as shown in Eq. 2–17. This process is illustrated in figures 2-6(a) through 2-6(c). 1 2 2 f(x)= exp−(x−XA) /2σ (2–16) σ√2π shifts −βi(x−Xi) fˆ(x)= αi exp (2–17) i X 37 NMR Spectrum NMR Spectrum with Gaussian Kernels interval = 1 sigma = 10 0.00 0.01 0.02 0.03 0.04 0.00 0.01 0.02 0.03 0.04 0 50 100 150 200 0 50 100 150 200 ppm ppm (a) Calculated NMR C13 Chemical Shifts (b) NMR C13 Chemical Shifts with Gaus- sian Kernels NMR Spectrum with Gaussian Kernels + BNMRS interval = 1 sigma = 10 Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0 50 100 150 200 ppm (c) NMR C13 Chemical Shifts with Gaussian Kernels plus the spectrum projected onto a bound scale (BNMRS) Figure 2-6. NMR QSAR. Calculated NMR spectra for a steroid molecule with Gaussian kernels place at each shift followed by a bound scale projected from the spectrum. These descriptors contain a wealth of structural information when we consider the physical basis of the methods. IR spectroscopy provides information concerning the presence of molecular functional groups and NMR chemical shifts are highly dependent on 38 substituent effects in a congeneric series of compounds. MO energies give the electronic structure of the molecule such as the HOMO-LUMO energies that play an important role in the binding process. The choice of theory used to calculate these descriptors depends on the number of compounds in the dataset and the accuracy that is required; all can be calculated using SE or ab initio methods. The QSAR results also depend on the choice of σ and x in the above equations. These methods have provided predictive models for a number of data sets and have an advantage over the field-based methods because they are alignment-free, in other words there is no need to superimpose the structures in the dataset. Asikainen and co-workers provided a comparison of these methods in a recent paper where they studied estrogenic activity in a series of compounds [89]. 2.5.4 Quantum QSAR and Molecular Quantum Similarity The Carb´ogroup has been involved in the development of the field of quantum QSAR and molecular quantum similarity since the 1980s [104]. The Quantum Similarity Measure (QSM) between any two molecules, A and B, can be calculated using the following: Z = ρ Ω ρ = ρ (r )Ω(r r )ρ (r )dr dr (2–18) AB h A| | Bi A 1 1 2 B 2 1 2 Z Z where Ω is some positive definite operator (e.g. kinetic energy or Coulomb) and ρ is the electron density. The QSMs can be transformed into indices ranging between 0 and 1 using: zAB rAB = (2–19) √zAAzBB yielding the so called Carb´oSimilarity Index (CSI). Calculating an array of QSMs or CSIs between all molecular pairs in some data set provides descriptors for Quantum QSAR (QQSAR) [105]. A drawback of the CoMFA-based methods is the need to superimpose the molecules in the training set. This is no easy task due to the many degrees of freedom (both rigid 39 and internal motions). However, the alignment of the molecular structures in a common 3D framework provides a convenient method of determining which regions of the molecules impact activity and which regions can be developed to create new compounds with more favorable properties. Recently, QSMs have been used with a Lamarckian genetic algorithm called the Quantum Similarity Superposition Algorithm (QSSA) to superimpose the classic CoMFA data set [106]. The QSSA is performed in such a way as to maximize the molecular similarity and does not rely on atom typing as other empirical based methods do. Popelier and co-workers have coupled the Atoms-In-Molecules (AIM) theory of Bader with quantum molecular similarity to produce Quantum Topological Molecular Similarity (QTMS) [107]. It uses the so-called Bond Critical Points (BCPs) of predefined bonds in a series of molecules as descriptors followed by multivariate statistical analysis. The series of compounds must have a common core for this method to remain computationally tractable. QTMS has been used to generate models to estimate the values for a set of aliphatic carboxylic acids, anilines, and phenols [108]. 2.6 Receptor Based Drug Design The classic “Pac-Man” representation of Receptor-Ligand binding is shown in Fig. 2-7 where the receptor is depicted on the left subdivided into residues and on the right is a small molecule split into fragments. Proteins can be split using standard amino acids definitions, while ligand structures can be decomposed using functional group definitions. The binding free energies between receptors and ligands can be calculated using classical and quantum mechanical methods. In most cases when QM is used in SBDD a single snap shot of this complex is taken and the interaction energy is determined. Taking ensemble averages is expensive and time consuming. The scheme in Fig. 2-8 is a matrix or graphical comparison between classical and quantum mechanical methods in SBDD. This scheme is divided into three parts, first on the left is a large box which represents a receptor made up of smaller boxes or residues, I, such as amino acids or bases. The 40 Figure 2-7. The Classic “Pac-man” Representation of Receptor-Ligand Binding. The receptor is depicted on the left subdivided into residues and on the right is a small molecule split into fragments. Figure 2-8. PWD Density Matrix Representation dark blue box represents how all the other residues in the receptor polarize that residue. Polarization is where the charges centered on each atom are allowed to relax in the field of all other charges. The lighter blue box symbolizes the charge transfer that can occur between residues in a receptor. The smaller box in the middle of the figure symbolizes a ligand and the smaller boxes it contains are molecular fragments, J. The pink and yellow boxes can be described in a similar fashion to the boxes of the receptor. 41 The largest box on the right is the complex structure. Both the residues, I, upon binding the fragments, J, are allowed to relax in the presence of the each other. The I residues are transformed from dark blue to mustard while the J fragments are changed from pink to brown. This is polarization; however, now it is caused by complex formation. Most classical potentials cannot model these effects; however, recently there have been some attempts to incorporate polarization into classical methods such as ff02 and amoeba [109]. Conversely, QM methods include these interactions implicitly. The I-K (blue to grey) and J-L (yellow to magenta) interactions originate from the effect binding has on the intramolecular interaction of the receptor or ligand. These interactions can include charge transfer and polarization. The I-J interactions (red box) are the most important. Both methods can calculate these and they are only present in the complex structure. Classical potentials describe the Coulombic and van der Waals interactions or electrostatic and dispersive effects between the moieties. The QM methods go a step further to include the other higher order effects such as polarization and charge transfer. This is where the QM methods begin to describe the physics of the system more completely; however, this does not necessarily suggest that they are more predictive! 2.7 Semiempirical Divide-And-Conquer Approach Very few full quantum mechanical studies of whole proteins have been published [7, 9, 110] but with the increasing speed of computers and linear scaling Divide-and-Conquer (D&C) techniques the ability to include the whole protein is now possible [111–113]. The D&C method takes advantage of the local character of chemical interactions that cause the magnitude of density matrix elements to decrease exponentially with distance. Through the use of cutoffs for the Fock and density matrices and D&C techniques, the “nearsightedness” of chemical interactions can be exploited without loss of accuracy [114]. The D&C method divides the molecular system into overlapping subsystems where each localized Roothaan-Hall equation can be solved separately: F αCα = CαEα (2–20) 42 where F α, Cα, and Eα are the subsystem Fock, coefficient, and orbital energy matrices. The overlap matrix, S, in SE methods is set equal to the identity matrix due to the NDDO (Neglect of Differential Diatomic Overlap) approximation: µAνB λCσD = µAνA λC σC δ δ (2–21) | | AB CD where δAB is the Kr¨onecker delta function: 1 if A = B, δAB = (2–22) 0 otherwise. The diagonalization of the global Fock matrix is the most expensive part of a standard SE calculation compared to the two-center two-electron integral evaluation which is the bottleneck of ab initio methods. However, subdividing the global Fock matrix in the D&C method replaces global diagonalization with subsystem diagonalizations which scales α 3 linearly with the number of subsystems, nsub (N ) . The subsystem density matrices are used to assemble the global density matrix and the total energy is calculated using Equation 2–11. The subsetting scheme in D&C methods is the key to its efficiency. Usually, each subsystem comprises a core region surrounded by one or more buffer regions. In protein systems, it has been shown that treating each amino acid as a core with a 4.5A/˚ 2.0A-buffering˚ scheme fits the compromise of computational efficiency and accuracy. The D&C method is not however, the only linear scaling SE method, other methods include density matrix minimization [115], and the localized molecular orbital method [116]. Recently, Raha and Merz reported a SE D&C based scoring function, QMSCORE, [117, 118] which is capable of predicting the binding free energy of protein-ligand complexes. QMSCORE is derived using current technologies to best describe the master equation 2–7: gas ∆Gbind = ∆Hbind + ∆LJ6 + ∆Ssolv + ∆Sconf + ∆∆Gsolv (2–23) 43 gas The enthalpic interactions in the gas phase, ∆Hbind, between the protein-ligand were determined using semiempirical Hamiltonians such as AM1 and PM3. The attractive part of the Lennard-Jones potential, ∆LJ6, was used to represent the dispersive interactions neglected by SE methods. The solvent entropy, ∆Ssolv, and conformational entropy, ∆Sconf , were accounted for by solvent accessible surface area (SASA) and number of rotational bonds. The solvation free energy due to complexation, ∆∆Gsolv, was calculated using a Poisson-Boltzmann continuum approach. QMSCORE was applied to 165 protein-ligand complexes including HIV protease, Serine protease, FKBP, and DHFR. Although there was a substantial increase in computational cost, it showed better performance than other scoring functions such as Autodock, DrugScore and LigScore. 2.8 Pairwise Energy Decomposition (PWD) QM methods are frequently used to determine the electronic energy of molecular systems. Electronic energies are quantities that characterize the whole system and do not provide any information regarding the key interactions taking place. Unlike a MM force field, QM does not easily lend itself to descriptions of energetics in a pairwise fashion. However, work first done by Fischer and Kollmar using a modified CNDO (Complete Neglect of Differential Overlap) method partitioned the energy into mono, EA, and bicentric terms, EAB [119]. N N N ET OT = EA + EAB (2–24) A=1 A=1 B>A X X X ′ core = EA + (EAB + EAB + EAB ) (2–25) " # XA B interactions between human Carbonic Anhydrase II, and a series of fluorine-substituted 44 ligands [7]. Similar to the decomposition by Fischer and Kollmar, the total energy can be calculated by summing the mono and bicentric terms as shown in Equation 2–25. ′ The bicentric term is comprised of a repulsive term EAB, an exchange term, EAB, and a core core-core repulsion term EAB (Eq. 2–26).