<<

AN INTERACTING QUANTUM APPROACH TO CONSTRUCTING A CONFORMATIONALLY DEPENDENT BIOMOLECULAR BY PROCESS REGRESSION: POTENTIAL ENERGY SURFACE SAMPLING AND VALIDATION

A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Engineering and Physical Sciences

2016

Salvatore Cardamone School of Contents

Abstract 20

Declaration 22

Copyright Statement 23

Acronyms and Abbreviations 25

Acronyms 25

Acknowledgements 29

30

1 Introduction 31

2 Exposition 36

2.1 Molecular Mechanics...... 36

2.1.1 ...... 36

2.1.2 Classical Force Fields...... 41

2.1.3 ab initio Molecular Dynamics...... 52

2 2.2 Atomic Partitioning...... 55

2.2.1 Hirshfeld Partitioning...... 56

2.2.2 Bader’s Atoms in Molecules...... 58

2.2.3 Interacting Quantum Atoms...... 64

2.3 Multipole Moment Electrostatics...... 70

2.3.1 Mathematical Details...... 70

2.3.2 Implementation...... 81

2.4 Machine Learning...... 97

2.4.1 Overview...... 97

2.4.2 Kriging...... 101

2.4.3 Particle Swarm Optimisation...... 106

2.5 The Quantum Chemical Topological Force Field...... 108

2.5.1 Previous Work...... 108

2.5.2 Proposed Implementation...... 110

3 Conformational Sampling 112

3.1 Introduction...... 112

3.2 Stationary Point Vibrations...... 116

3.2.1 Kinetic Energy...... 116

3.2.2 Potential Energy...... 117

3.2.3 Equations of Motion...... 118

3.3 Generalised Vibrations...... 121

3.4 Tyche - Conformational Sampling Software...... 124

3 3.4.1 Transformation to Normal Coordinates...... 124

3.4.2 Dynamics...... 127

3.4.3 Markov Chains on Tessellated PESs...... 134

3.4.4 Algorithm...... 137

3.5 Redundant Internal Coordinates...... 140

3.5.1 Overview...... 140

3.5.2 Valence Coordinates...... 142

3.5.3 Implementation...... 147

3.6 Results...... 149

3.6.1 Validation and Benchmarking...... 149

3.6.2 Markov Chain Conformational Sampling...... 168

3.6.3 Conformational Sampling of Large Molecular Species.... 173

3.7 Conclusion...... 176

4 A Novel Carbohydrate Force Field 178

4.1 Introduction...... 178

4.2 A Basis Set for Machine Learning...... 182

4.3 Computational Details...... 185

4.4 Results and Discussion...... 193

4.4.1 Single Minimum...... 193

4.4.2 Training Set Size Dependency...... 196

4.4.3 Multiple Minima...... 200

4.5 Conclusion...... 202

4 5 Subset Selection for Machine Learning 204

5.1 Introduction...... 204

5.2 Greedy Heuristics...... 206

5.2.1 Experimental Details...... 208

5.2.2 MaxMin...... 210

5.2.3 Deletion...... 213

5.2.4 Alternative Distance Metric...... 218

5.3 Sequential Selection...... 225

5.4 Iterative Voronoi Subset Selection...... 233

5.5 Domain of Applicability...... 239

5.5.1 Introduction...... 239

5.5.2 The Convex Hull...... 242

5.5.3 Test Case...... 245

5.5.4 Largest d-Polytope...... 247

5.6 Conclusion...... 257

6 Raman Optical Activity 259

6.1 General Theory...... 261

6.1.1 Classical Raman Scattering...... 261

6.1.2 The Electromagnetic Field...... 264

6.1.3 Hamiltonian in an Electromagnetic Field...... 268

6.1.4 Wavefunction Perturbed by Periodic Field...... 272

6.1.5 The Raman Optical Activity Tensors...... 276

5 6.1.6 ROA Intensities...... 279

6.2 Previous Computational Work...... 281

6.2.1 Algorithmic Developments...... 281

6.2.2 Solvation Effects...... 284

6.2.3 Quantitative Measures of Similarity...... 286

6.2.4 Large Biomolecular Systems...... 288

6.3 Technical Details...... 290

6.3.1 Zwitterionic Histidine...... 290

6.3.2 Computing Spectra from a Number of Conformers...... 294

6.3.3 Filtering of Similar Geometries...... 295

6.4 Computational Details...... 299

6.4.1 Molecular Dynamics...... 299

6.4.2 Geometric Filtering...... 301

6.4.3 Calculation of Spectra...... 304

6.5 Conformer Optimisation...... 305

6.5.1 Boltzmann Weighting...... 307

6.5.2 Effects of Level of Optimisation...... 312

6.5.3 Alternative Optimisation Schemes...... 315

6.6 Microsolvation...... 320

6.6.1 Neutral Histidine...... 321

6.6.2 Protonated Histidine...... 325

6.7 Conclusion...... 330

6 7 Conclusion and Future Work 331

7.1 General Conclusions...... 331

7.2 Future Work...... 334

Bibliography 338

A Density Functional Theory 395

A.1 Thomas-Fermi Model...... 395

A.2 Hohenberg-Kohn Theorems...... 397

A.2.1 First Theorem...... 397

A.2.2 Second Theorem...... 398

A.3 Kohn-Sham Equations...... 400

B Storage of Kriging Models 403

7 List of Tables

2.1 The number of geometric minima predicted by a variety of force fields for serine and cysteine. Also given are the number of minima that the various force fields predicted but that were not represented in the set of minima generated by ab initio calculations, and the mean average deviations (MAD) for the molecular energies at each geometry...... 95

2.2 Free energies of hydration for a variety of molecules predicted by use of AMOEBA and TIP3P-like water potentials. Corresponding experimental values are also given. Energies are in kcal mol-1.... 97

3.1 Memory requirements for tyche...... 140

3.2 Conformational sampling of water at a number of values of nperiod.. 150

3.3 Conformational sampling of NMA at a number of values of nperiod.. 150

3.4 Conformational sampling of histidine at a number of values of nperiod.151

3.5 Conformational sampling of water using the Jacobian and Hessian from a number of different levels of theory...... 152

3.6 Conformational sampling of NMA using the Jacobian and Hessian from a number of different levels of theory...... 152

3.7 Conformational sampling of histidine using the Jacobian and Hes- sian from a number of different levels of theory...... 152

8 3.8 Sampling ranges of four prominent bond stretching degrees of free- dom in zwitterionic alanine as the sampling temperature is increased. The ab initio Hessian and Jacobian are taken from gas phase fre- quency calculations. In the bottom row, we include the ranges of these degrees of freedom as obtained from a 300 K in vacuo MD tra- jectory. In red, we have refined these sampling ranges down so that they represent the sampling ranges taken by over 98% of conformers over the course of the MD trajectory...... 162

3.9 Sampling ranges of four prominent valence angle bending degrees of freedom in zwitterionic alanine as the sampling temperature is increased. The ab initio Hessian and Jacobian are taken from gas phase frequency calculations. In the bottom row, we include the ranges of these degrees of freedom as obtained from a 300 K in vacuo MD trajectory. In red, we have refined these sampling ranges down so that they represent the sampling ranges taken by over 98% of conformers over the course of the MD trajectory...... 163

3.10 Sampling ranges of four prominent bond stretching degrees of free- dom in zwitterionic alanine as the sampling temperature is increased. The ab initio Hessian and Jacobian are taken from implicitly sol- vated (CPCM) frequency calculations. In the bottom row, we in- clude the ranges of these degrees of freedom as obtained from a 300 K TIP3P-solvated MD trajectory. In red, we have refined these sampling ranges down so that they represent the sampling ranges taken by over 98% of conformers over the course of the MD trajectory.165

3.11 Sampling ranges of four prominent valence angle bending degrees of freedom in zwitterionic alanine as the sampling temperature is in- creased. The ab initio Hessian and Jacobian are taken from implic- itly solvated (CPCM) frequency calculations. In the bottom row, we include the ranges of these degrees of freedom as obtained from a 300 K TIP3P-solvated MD trajectory. In red, we have refined these sampling ranges down so that they represent the sampling ranges taken by over 98% of conformers over the course of the MD trajectory.166

3.12 Sampling ranges of four prominent bond stretching degrees of free- dom in zwitterionic alanine, as obtained through tyche with a sin- gle seeding geometry (tyche (Single)), with five seeding geometries (tyche (Multi)) and through a MD trajectory...... 169

9 3.13 Sampling ranges of four prominent valence angle bending degrees of freedom in zwitterionic alanine, as obtained through tyche with a single seeding geometry (tyche (Single)), with five seeding geome- tries (tyche (Multi)) and through a MD trajectory...... 169

4.1 Mean errors associated with the CDEs given in Figure 4.3...... 193

4.2 Mean errors corresponding to the CDEs depicted in Figure 4.6. Note the ‘bunching’ of prediction error when 60 energetic minima or higher are used as seeds for the conformational sampling..... 201

5.1 Maximum prediction errors for each small molecule kriging model as the training set size is increased. Maximum prediction errors for the randomly constructed training sets are presented in the final row.212

5.2 Comparison of the greedy MaxMin and greedy Deletion algorithms on a candidate set in R2, with the aim of constructing a subset, where |S| = 4. The first row, corresponding to the steps taken by the MaxMin algorithm, adds two points (red) at each iteration to the subset, whose distance is the largest in the entire candidate set. The second row corresponds to the steps taken by the Deletion al- gorithm. Here, the two points in the candidate set which are closest together are selected (green). Their nearest neighbours (orange) are then evaluated, and the point with the closest nearest neighbour is removed from the candidate set...... 214

5.3 Maximum prediction errors for each small molecule kriging model as the training set size is increased. Maximum prediction errors for the randomly constructed training sets are presented in the final row.217

5.4 MaxMin-constructed training set maximum prediction errors for each small molecule kriging model as the training set size is in- creased. Maximum prediction errors for the randomly constructed training sets are presented in the final row...... 222

5.5 Deletion-constructed training set maximum prediction errors for each small molecule kriging model as the training set size is in- creased. Maximum prediction errors for the randomly constructed training sets are presented in the final row...... 223

10 5.6 Average errors associated with the SS1, SS10 and SS1040 kriging models relative to the random benchmark...... 231

5.7 Maximum errors associated with the SS1, SS10 and SS1040 kriging models relative to the random benchmark...... 231

5.8 Standard deviation or errors associated with the SS1, SS10 and SS1040 kriging models relative to the random benchmark...... 232

6.1 Conformer Boltzmann weights exceeding 1% for the 35 microsol- vated histidine systems using the system-weighting and solute-weighting schemes...... 308

11 List of Figures

1.1 Simulation techniques available to the computational chemist.... 32

2.1 A contour plot of the electron density of in molecular plane of fu- ran superimposed onto a representative collection of gradient paths. Atoms are represented by black circles, where the gradient paths terminate. Interatomic surfaces are highlighted as solid curves, and contain bond critical points (black squares). A ring critical point (triangle) is also shown in the centre of the furan...... 62

2.2 Vectors r and r0 in a Cartesian axis system, along with their colati- tudinal and azimuthal degrees of freedom, (Θ, Φ) and (θ, φ), respec- tively...... 72

2.3 Vector quantities required for an expansion, in multipole moments, of the interaction between the charge distributions centred at RA and RB...... 74

2.4 The overlap monopole (upper graph) and dipole (lower graph) mo- ments at various overlap centres along the hydrogen fluoride inter- nuclear axis. The nuclear charges have been added to the atomic centres...... 80

3.1 Distribution of molecular energies obtained through tyche by use of the energy allocation scheme of (3.4.12). The lowest energy point corresponds to the ab initio minimum...... 130

3.2 Distribution of molecular energies obtained through tyche by use of the energy allocation scheme of (3.4.13). The lowest energy point corresponds to the ab initio minimum...... 131

12 3.3 Distribution of molecular energies obtained through tyche by use of the energy allocation scheme of (3.4.14). The lowest energy point corresponds to the ab initio minimum...... 132

3.4 One-dimensional analytical PES (blue) and the tyche-generated PES (red), recreated through the seeding geometries A-G. Stochas- tic dynamics are performed upon this tessellation of LWs in an attempt to generate a sample set that would be representative of continuous dynamics on the ab initio PES...... 139

3.5 The conventionally invoked valence coordinates. (a) corresponds to bond stretching, (b) corresponds to valence angle bending, and (c) corresponds to dihedral torsion. The major components that will be required in the following discussion are shown...... 144

3.6 tyche output outlining the normal mode information of water. Normal modes 1, 2 and 3 correspond to the valence angle bend- ing, symmetric stretching and asymmetric stretching modes, respec- tively. The RIC contributions to each normal mode are given at the bottom of the figure...... 148

3.7 Range of CH bond stretch degrees of freedom in methane (green), ethane (red), neopentane (yellow) and cyclopentane (blue) as the temperature is increased. The black lines correspond to the range of sampling undertaken in Maple et al...... 157

3.8 Range of HCH valence angle degrees of freedom in methane (green), ethane (red), neopentane (yellow) and cyclopentane (blue) as the temperature is increased. The black lines correspond to the range of sampling undertaken in Maple et al...... 157

3.9 Range of CN bond stretch degrees of freedom in formamide (green), N-methylacetamide (red), methylazocyclopropanone (yellow) and N,N-dimethylformamide (blue) as the temperature is increased. The black lines correspond to the range of sampling undertaken in Maple et al...... 158

3.10 Range of OCN valence angle degrees of freedom in formamide (green), N-methylacetamide (red), methylazocyclopropanone (yellow) and N,N-dimethylformamide (blue) as the temperature is increased. The black lines correspond to the range of sampling undertaken in Maple et al...... 158

13 3.11 Alanine zwitterion atomic labelling...... 161

3.12 Histograms for the sampling of the alanine sidechain dihedral angle by use of the various tyche sampling schemes (by seeding with energetic minima (red), MD snapshot conformations (green) and the selective mode distortion (green)), and the MD sampling (blue). 171

3.13 Helix stretching and bending modes of motion for the alanine-isoleucine helix...... 174

3.14 Helix breathing motion recovered from selectively evolving those modes comprising dihedral (red), valence angle (green) and bond stretching (blue) degrees of freedom...... 175

3.15 Helix bending motion recovered from selectively evolving those modes comprising dihedral (red), valence angle (green) and bond stretching (blue) degrees of freedom...... 176

4.1 The Atomic Local Frame for the carbon of methanol. From the global coordinate system, x, y, z, one is able to define an ALF, the basis vectors of which are denoted xALF , yALF , zALF , from the vectors RΩo, RΩx, RΩxy. Any atom not defining the ALF, e.g. Rn, is defined in the ALF with the spherical polar coordinates rn, θn, φn. 184

4.2 Erythrose and Threose in the open chain and furanose forms..... 190

4.3 CDEs corresponding to all erythrose (top) and threose (bottom) systems studied. The open chain forms are systematically better predicted than the corresponding furanose forms...... 194

4.4 CDEs for erythrose open chain at various training set sizes. Note the progression of the CDEs towards the lower prediction errors as the training set size increases. However, owing to the logarithmic abscissa, this does not correspond to a uniform enhancement of a kriging model given a consistently larger training set size...... 196

14 4.5 Mean prediction errors associated with CDEs for erythrose open chain as the training set size is increased. With the old kriging engine (left), a distinct plateau formed after roughly 1200 training examples, corresponding to no further improvement in the krig- ing model despite additional training points. However, the new kriging engine (right) appears to avoid premature plateauing, with additional kriging model improvement at higher training set sizes. Regression fits of the ferebus errors against√ training set size, of functional forms A + B/n (blue) and C + D/ n (black), where n is the training set size, are also given...... 199

4.6 S-curves depicting the power of a kriging model as more energetic minima are utilised for conformational sampling. As the number of minima utilised increases, the CDEs tend towards higher prediction errors. The kriging models which underlie these CDEs have a fixed training set size of 700...... 201

5.1 Average molecular energy error, with accompanying standard de- viations, in the testing set conformations given a training set con- structed by the MaxMin algorithm from the respective candidate sets at a number of training set sizes. In each pane, the dark black line corresponds to the average molecular energy error in the testing set given a training set of size 1000, generated randomly from the candidate set...... 211

5.2 Average molecular energy error, with accompanying standard de- viations, in the testing set conformations given a training set con- structed by the Deletion algorithm from the respective candidate sets at a number of training set sizes. In each pane, the dark black line corresponds to the average molecular energy error in the testing set given a training set of size 1000, generated randomly from the candidate set...... 216

5.3 Average molecular energy error, with accompanying standard de- viations, in the testing set conformations given a training set con- structed by the MaxMin algorithm utilising the kriging covariance as a distance metric. In each pane, the dark black line corresponds to the average molecular energy error in the testing set given a training set of size 1000, generated randomly from the candidate set. 220

15 5.4 Average molecular energy error, with accompanying standard de- viations, in the testing set conformations given a training set con- structed by the Deletion algorithm utilising the kriging covariance as a distance metric. In each pane, the dark black line corresponds to the average molecular energy error in the testing set given a training set of size 1000, generated randomly from the candidate set. 221

5.5 Cumulative distribution functions of errors associated with the se- quential selection kriging models. SS1 (green), SS10 (blue) and SS1040 (yellow) are compared to a randomly constructed training set benchmark (red). Whilst the SS1 model performs similarly to the random benchmark, the SS10 and SS1040 both considerably outperform the random benchmark...... 230

5.6 Voronoi tessellation of 12 points in R2, denoted by black points. The Voronoi cell corresponding to each point is coloured. Voronoi cells are partitioned from one another by lines. Where three or more Voronoi cells meet, a vertex is formed. Image generted by the interactive Voronoi diagram generator, located at http://www. alexbeutel.com/webgl/voronoi.html...... 235

5.7 Cumulative distribution function of errors for a training set con- structed by the Iterative Voronoi Subset Selection (blue) and a ran- domly constructed training set (red)...... 237

5.8 The Ackley function in two dimensions. The function is projected down onto the xy-plane as a colour-contour plot...... 245

5.9 Absolute error (left) and MSE (right) of predictions made from within the convex hull (yellow) and outside of the convex hull (blue) of the training set...... 246

5.10 Progression of average error ratio (left abscissa) and ∆ (right ab- scissa) with increasing acceptance percentage for the water monomer (R3)...... 251

3 5.11 Definition of three vertices used in the construction of AS in R ... 253

5.12 Progression of average error ratio (left abscissa) and ∆ (right ab- scissa) with increasing acceptance percentage for the methanol (R12).254

16 5.13 Progression of average error ratio (left abscissa) and ∆ (right ab- scissa) with increasing acceptance percentage for the NMA (R30).. 255

5.14 Progression of average error ratio (left abscissa) and ∆ (right ab- scissa) with increasing acceptance percentage for the water decamer (R84)...... 255

0 6.1 Zwitterionic histidine (His [Nτ H]) where the imidazole nitrogen atoms have been labelled as indicated in the text. The two sidechain di- hedral angles, χ1, χ2 are also labelled...... 291

6.2 Zwitterionic histidine protonation states and tautomers...... 292

6.3 Fractional contribution of the various histidine charge states as a function of pH. Blue, green, red and yellow curves correspond to the fractional contributions of the His2+, His1+, His0 and His1- states, respectively. We have limited the pH range between 0 and 12, and so omit the fractional contribution of the His2- state...... 293

6.4 Sidechain dihedral angles (χ1, χ2) taken by conformers within the three histidine ensembles obtained by a Kabsch filtering of the MD 1+ 0 trajectories; His (blue circles), His [NπH] (green diamonds) and 0 His [Nτ H] (red triangles). Grey hatched regions correspond to those conformers used in the work of Deplazes et al...... 302

6.5 Raman spectra of the different Boltzmann weighting schemes. The top pane shows the Raman spectrum resulting from the system- weighting and the middle pane from the solute-weighting. The bottom pane is the experimental Raman spectrum of zwitterionic histidine...... 309

6.6 ROA spectra of the different Boltzmann weighting schemes. The top pane shows the ROA spectrum resulting from system-weighting and the middle pane from the solute-weighting. The bottom pane is the experimental ROA spectrum of zwitterionic histidine..... 310

17 6.7 Raman spectra of zwitterionic histidine computed from a number of levels of geometrical optimisation. The top pane contains the Ra- man spectrum computed directly from unoptimised MD snapshots, while the middle two contain Raman spectra computed from con- formers which have been optimised with the “Loose” and “Regular” convergence criteria. The bottom panel contains the experimental Raman spectrum of histidine...... 313

6.8 ROA spectra of zwitterionic histidine computed from a number of levels of geometrical optimisation. The top pane contains the ROA spectrum computed directly from unoptimised MD snapshots, while the middle two contain Raman spectra computed from conformers which have been optimised with the “Loose” and “Regular” conver- gence criteria. The bottom panel contains the experimental ROA spectrum of histidine...... 314

6.9 Raman spectra resulting from the optimisation methodologies we have outlined. Enumeration corresponds to the labels we have used in the main text...... 318

6.10 ROA spectra resulting from the optimisation methodologies we have outlined. Enumeration corresponds to the labels we have used in the main text...... 319

6.11 Raman spectra of microsolvated histidines. From the top pane to the bottom pane, we show histidine microsolvated by 5, 10, 15 and 20 explicit water molecules. The final pane contains the experimen- tal histidine Raman spectrum...... 322

6.12 ROA spectra of microsolvated histidines. From the top pane to the bottom pane, we show histidine microsolvated by 5, 10, 15 and 20 explicit water molecules. The final pane contains the experimental histidine ROA spectrum...... 324

6.13 Raman spectra of microsolvated cationic histidine. From the top pane to the bottom pane, we show cationic histidine microsolvated by 5, 10, 15 and 20 explicit water molecules. The final pane contains the experimental histidine Raman spectrum...... 327

18 6.14 ROA spectra of microsolvated cationic histidine. From the top pane to the bottom pane, we show cationic histidine microsolvated by 5, 10, 15 and 20 explicit water molecules. The final pane contains the experimental histidine ROA spectrum...... 328

19 Abstract

The energetics of chemical systems are quantum mechanical in origin and depen- dent upon the internal molecular conformational degrees of freedom. “Classical force field” strategies are inadequate approximations to these energetics owing to a plethora of simplifications- both conceptual and mathematical. These simplifi- cations have been employed to make the in silico modelling of molecular systems computationally tractable, but are also subject to both qualitative and quanti- tative errors. In spite of these shortcomings, classical force fields have become entrenched as a cornerstone of .

The Quantum Chemical Topological Force Field (QCTFF) has been a central re- search theme within our group for a number of years, and has been designed to ameliorate the shortcomings of classical force fields. Within its framework, one can undertake a full spatial decomposition of a chemical system into a set of fi- nite atoms. Atomic properties are subsequently obtained by a rigorous quantum mechanical treatment of the resultant atomic domains through the theory of Inter- acting Quantum Atoms (IQA). Conformational dependence is accounted for in the QCTFF by use of Gaussian Process Regression, a machine learning technique. In so doing, one constructs an analytical function to provide a mapping from a molec- ular conformation to a set of atomic energetic quantities. One can subsequently conduct dynamics with these energetic quantities.

The notion of “conformational sampling” is shown to be of key importance to the proper construction of the QCTFF. Conformational sampling is a key theme in this work, and a subject that we will expatiate. We suggest a novel conformational sampling scheme, and attempt a number of conformer subset selection strategies to construct optimal machine learning models. The QCTFF is then applied to carbo- hydrates for the first time, and shown to produce results well within the commonly invoked threshold of “chemical accuracy”- O(β−1), where β is the thermodynamic beta. Finally, we present a number of methodological developments to aid in both the accuracy and tractability of predicting ab initio vibrational spectroscopies.

20

Declaration

No portion of the work referred to in the thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning.

Several publications from the work outlined in this thesis are included in the addendum. They are listed here for convenience:

Cardamone, S. et al., Phys. Chem. Chem. Phys., 16:10367-10387, 2014 Cardamone, S. and Popelier, P.L.A., J.Comp.Chem., 36(32):2361-2373, 2015 Hughes, T.J. et al., J. Comp. Chem., 36(24):1844-1857, 2015 Maxwell, P. et al., Theor. Chem. Acc., 135(8):195-210, 2016 Cardamone, S. et al., Phys. Chem. Chem. Phys., 18(39):27377-27389, 2016

Those publications in which I am the first author have been used to construct this thesis. Only those portions that I have written have been reproduced.

This thesis does not exceed 80,000 words.

Salvatore Cardamone, November 22, 2016. 22 Copyright Statement

i. The author of this thesis (including any appendices and/or schedules to this thesis) owns certain copyright or related rights in it (the “Copyright”) and s/he has given The University of Manchester certain rights to use such Copyright, including for administrative purposes.

ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has from time to time. This page must form part of any such copies made. iii. The ownership of certain Copyright, patents, designs, trade marks and other in- tellectual property (the “Intellectual Property”) and any reproductions of copy- right works in the thesis, for example graphs and tables (“Reproductions”), which may be described in this thesis, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permis- sion of the owner(s) of the relevant Intellectual Property and/or Reproductions. iv. Further information on the conditions under which disclosure, publication and commercialisation of this thesis, the Copyright and any Intellectual Property

23 and/or Reproductions described in it may take place is available in the Univer- sity IP Policy, in any relevant Thesis restriction declarations deposited in the University Library, The University Library’s regulations and in The University’s Policy on Presentation of Theses.

24 Acronyms and Abbreviations

ADMP Atom Centred Density Matrix Propagation

AIM Atoms In Molecules

ALF Atomic Local Frame

AMBER Assisted Model Building with Energy Refinement

AMOEBA Atomic Multipole Optimized Energetics for Biomolecular Applications

ANN Artificial Neural Network aug-cc-pVDZ Augmented Correlation Consistent Polarisable Valence Double Zeta

B3LYP Becke 3-Lee Yang Parr

CCSD(T) Coupled Cluster Singles and Doubles (with Perturbative Triples)

CCT Cartesian Coordinate Tensor Transfer

CDE Cumulative Distribution of Errors

25 CHARMM Chemistry at Harvard Macromolecular Mechanics

CHELPG Charges from Electrostatic Potentials using a Grid-Based Method

COSMO Conductor-Like Screening Model

CSD Cambridge Structural Database

DoA Domain of Applicability

DFT Density Functional Theory

DMA Distributed Multipole Analysis

GROMACS Groningen Machine for Chemical Simulations

GPR Gaussian Process Regression

HF Hartree Fock

IAS Interatomic Surface

IQA Interacting Quantum Atoms

IVSS Iterative Voronoi Subset Selection

LdP Largest d−Polytope

LW Local Well

M06 Minnesota Functional 06

MD Molecular Dynamics

MEP Molecular Electrostatic Potential

26 MPn Moller-Plesset nth order Perturbation Theory mRMSD Minimised Root Mean Squared Deviation

NMA N-methyl Acetamide

OPLS-AA Optimized Potentials for Liquid Simulations All Atoms p-RDM pth Order Reduced Density Matrix

PCM Polarisable Continuum Model

PIMD Path Integral Molecular Dynamics

PES Potential Energy Surface

PSO Particle Swarm Optimisation

QCT Quantum Chemical Topology

QCTFF Quantum Chemical Topological Force Field

RAH-LdP Restricted Affine Hull of the Largest d−Polytope

RDF Radial Distribution Function

RESPA Reversible Reference System Propagator Algorithm

RIC Redundant Internal Coordinate

ROA Raman Optical Activity

RMSD Root Mean Squared Deviation

SS Sequential Selection

27 SVD Singular Value Decomposition

TAE Transferable Atom Equivalents

TDSCF Time-Dependent Self-Consistent Field

TDSE Time-Dependent Schr¨odingerEquation

28 Acknowledgements

From commencing my university studies in mechanical engineering, quickly switch- ing to biochemistry, undertaking a PhD in theoretical chemistry, and now tran- sitioning to physics for a postdoctoral position, my academic trajectory has been rather unconventional.

I can probably say, with some degree of certainty, that if it weren’t for my parents’ continual support throughout my periods of indecision, I would have been in no position to submit a thesis. I can’t possibly express the full extent of my gratitude in prose. The greatest accolade I can strive for is to consider myself half as accom- plished as they are. They have made this work possible, and so it is dedicated to them in its entirety.

I am indebted to Professor P L A Popelier for his guidance and direction over the past four years. I can appreciate the risk associated with taking me on as a PhD candidate given my previous propensity for switching disciplines. I can only hope that my contributions to his group’s work go some way to repaying his efforts.

I would like to thank Dr E W Blanch, and all the members of the Popelier group, past and present, with whom I have worked over the course of my research. Thanks are also extended to all administrative and technical staff at The University of Manchester with whom I have liaised.

Finally, I would like to thank the BBSRC for a fully-funded DTP studentship.

29 “... Chemistry is that branch of natural philosophy in which the greatest improvements have been and may be made: it is on that account that I have made it my peculiar study; but at the same time I have not neglected the other branches of science. A man would make but a very sorry chemist if he attended to that department of human knowledge alone. If your wish is to become really a man of science, and not merely a petty experimentalist, I should advise you to apply to every branch of natural philosophy, including mathematics.”

frankenstein Mary Shelley Chapter 1

Introduction

There are a number of commonly echoed sentiments that enunciate the difficulty in applying quantum mechanics to chemical systems. The statements by Paul [1] and Walter Kohn[2] are perhaps two of the most overused. To keep this work as uncontrived as possible, the reader is spared from having to read them once more. One can distil their observations down to the fact that the quantum me- chanical state vector, the wavefunction, is a complex, high-dimensional object, and its representation requires exponential complexity. Indeed, it has even been sug- gested that the Schr¨odinger equation cannot be solved accurately when the number of particles is greater than ten, by virtue of this “curse of dimensionality”[3].

However, by the above reasoning, one is immediately met with a contradiction, in that there exist a plethora of methods for calculating with, or manipulating wavefunctions for systems of many particles. It can be argued that the complex- ity alluded to by Dirac and Kohn is illusory. Consider two completely different macroscopic wavefunctions of a system, i.e. they occupy completely different re- gions of the system’s Hilbert space. Both wavefunctions satisfy the Schr¨odinger

31 32 CHAPTER 1. INTRODUCTION equation for the system. Then, it is formally valid to write a superposition of these two wavefunctions that satisfies the Schr¨odingerequation. Of course, this line of reasoning is equivalent to the famous Schr¨odingercat gedanken experiment.

However, for the vast majority of physically realisable (low energy) states of the system, nature is not required to explore the entire Hilbert space. As such, super- positions of completely different wavefunctions are irrelevant to an analysis of the underlying physics of the system. Put more concisely, there are no Schr¨odinger cats in [4]. One is then able to confine their analysis to small regions of the quantum Hilbert space of the system, and consequently reduce the complexity of the mathematics alluded to by Dirac and Kohn. In so doing, a num- ber of computationally feasible schemes arise for undertaking quantum mechanical calculations on chemical systems.

Figure 1.1: Simulation techniques available to the computational chemist.

Broadly speaking, the computationally tractable methodologies for analysing a 33 system can be categorised based on their accuracy and computational feasibility, as depicted in Figure 1.1. Those techniques towards the left-hand side of the diagram are typically cheap to evaluate, but also very approximate in nature. Moving further to the right, the techniques become more accurate, but also more expensive to evaluate. Those techniques at the far right of the diagram are only viable options for systems containing a very small number of atoms. The computational chemist is then required to select an appropriate technique based on the desired accuracy of a calculation and the computational facilities that are available to them.

Biochemical systems are very difficult to treat with the lower accuracy method- ologies of Figure 1.1. Biochemical systems are subject to a wealth of chemical interactions: hydrogen bonding, halogen bonding, entropic effects, interactions with heavy elements requiring a relativistic treatment, reactions, etc. In addition, biochemical systems can rarely be treated as small, closed thermodynamic sys- tems; the chemical environment typically plays an important role in dictating the chemistry of these systems. As such, the system size required for a meaningful simulation is large. Both the complexity and size of the systems render the ma- jority of the techniques depicted in Figure 1.1 unideal strategies for dealing with biochemical systems.

The aim of this work is to develop a methodology that sits somewhere in the middle of the categorisation of Figure 1.1. Ideally, we would like to combine the speed of pair potential techniques with the accuracy of density functional theory and wavefunction-based techniques. Density functional theory (DFT) is of such importance to computational and theoretical chemistry, that its formulation is given in AppendixA. The methodology outlined will be referred to herein as the Quantum Chemical Topological Force Field (QCTFF). We make no attempt to clarify the jargon that has just been invoked; its meaning will be revealed over the 34 CHAPTER 1. INTRODUCTION course of the next chapter.

We immediately make a clarification on the nomenclature used throughout this work. Bader[5] has made a useful distinction between the notion of “configu- ration” and “geometry”. A configuration of a molecular system corresponds to the molecular graph of the system, i.e. the bonding pattern of the system. In contrast, a denotes the positions of the constituent atoms in some basis, Cartesian or other. Whilst the molecular geometry is subject to change with respect to an infinitesimal displacement in atomic positions, the molecular configuration is invariant.

Rather than use the term “geometry”, we choose to use the term “conformation”. We find this term less ambiguous, since geometry can be used for any number of mathematical quantities, whereas the notion of conformation is more restrictive. The idea of a conformation and the sampling of conformations is a predominant theme in this work. We will use r ∈ R3, to denote Cartesian three-vectors. x ∈ Ra, with a > 3, will be used for higher-dimensional Cartesian vectors. q ∈ Ra will be used for a generalised vector, whose basis vectors are not necessarily orthogonal. We attempt to adhere to this convention throughout.

In Chapter2, we outline the general principles that will be implemented in the construction of the QCTFF. The mathematical details and derivations are self- contained, and are necessarily rigorous.

In Chapter3, we outline the importance of conformational sampling of a molecular species for the QCTFF. Further, we introduce a novel conformational sampling methodology that allows the user to conduct a conformational sampling of the physically relevant regions of molecular conformational space. 35

Chapter4 is a general proof-of-concept of the QCTFF methodology that has been developed in our lab over the past decade. For the first time, the QCTFF is applied to carbohydrates, and the resulting errors are shown to be well within the limits of “chemical accuracy”.

In Chapter5, we enumerate a number of problems that have been encountered with the construction of the QCTFF. A variety of remedies are attempted, both from techniques that are available in the literature and a number of novel strategies that have been developed over the past few years.

Finally, Chapter6 discusses the seemingly unrelated topic of spectroscopic pre- diction, and introduces a number of novel formalisms for improving the accuracy of these methods. It is hoped that spectroscopic prediction will be a key tool for validating the QCTFF in the near future. Chapter 2

Exposition

2.1 Molecular Mechanics

2.1.1 Molecular Dynamics

Our starting point is the time-dependent Schr¨odingerequation (TDSE) for a molec- ular system comprising Ne electrons and NN nuclei,

∂ i Ψ(r, R; t) = H Ψ(r, R; t) , (2.1.1) ~∂t

where Ψ is the wavefunction of the molecular system, a function of the 3Ne−dimensional vector, r, containing electronic positions, the 3NN −dimensional vector, R, con- taining nuclear positions, and depends parametrically on time1. The Hamiltonian

1To be more precise, r and R are functions of time, and therefore correspond to electronic   and nuclear trajectories, respectively. The wavefunction is then Ψ r(t), R(t) .

36 2.1. MOLECULAR MECHANICS 37 operator, H , has standard form, 2 2 X ~ 2 X ~ 2 X qiqj X qiqI X qI qJ H = − ∇I − ∇i + − + , 2MI 2mi |ri − rj| |RI − ri| |RI − RJ | I i i

X qiqj X qiqI X qI qJ Vne = − + (2.1.4) |ri − rj| |RI − ri| |RI − RJ | i

By invoking the Born-Oppenheimer approximation, we invoke the product ansatz for the molecular wavefunction[6], whereby the nuclear and electronic wavefunc- tions are entirely separable,

Ψ(r, R; t) = ψ(r; t)χ(R; t) . (2.1.6) 38 CHAPTER 2. EXPOSITION

Substituting (2.1.3) and (2.1.6) into (2.1.1),

∂ i ψ(r; t)χ(R; t) = H ψ(r; t)χ(R; t) ~∂t 2 2 X ~ 2 X ~ 2 = − ψ(r; t) ∇I χ(R; t) − χ(R; t) ∇i ψ(r; t) (2.1.7) 2MI 2mi I i

+ Vne(r, R)ψ(r; t)χ(R; t) .

2 2 Note that we have eliminated those terms containing ∇I ψ(r; t) and ∇i χ(R; t) since the nuclear degrees of freedom are independent of the electronic degrees of freedom, and vice versa. Multiplying (2.1.7) through by hχ| and hψ|, we obtain the following relations,

2 ∂ X ~ 2 i ψ(r; t) = − ∇ ψ(r; t) + hχ| Vne(r, R) |χi ψ(r; t) , (2.1.8) ~∂t 2m i i i 2 ∂ X ~ 2 i~ χ(R; t) = − ∇I χ(R; t) + hψ| He(r, R) |ψi χ(R; t) , (2.1.9) ∂t 2mI I respectively. Iteratively solving these two equations to self-consistency yields solu- tion to (2.1.1), and is referred to as the time-dependent self-consistent field (TD- SCF) method[7]. The first equation, (2.1.8), corresponds to evolving the electronic wavefunction in the presence of a potential resulting from the nuclear degrees of freedom, and the second, (2.1.9), to evolving the nuclear wavefunction in the po- tential exerted by the electronic degrees of freedom.

The Born-Oppenheimer approximation allows us to assume that the electronic de- grees of freedom adjust instantaneously to any displacement in the nuclear degrees of freedom. This assumption comes about because the nuclei are more than three orders of magnitude heavier than electrons, and so adjust more slowly to external fields than their much lighter counterparts. Considering the manifestation of this in the TDSE, the nuclear wavefunction can be represented as a series of delta functions, while the electronic wavefunction is much more diffuse. We can model 2.1. MOLECULAR MECHANICS 39 this by treating the nuclei as classical particles[8], a common approach to which is writing the nuclear wavefunction as " # iS(R; t) χ(R; t) ≈ A(R; t) exp , (2.1.10) ~ where A(R; t),S(R; t) are real functions, and correspond to an amplitude and phase factor, respectively. Inserting the nuclear wavefunction (2.1.10) into (2.1.9), we obtain

" # 2 ! " # ∂ iS(R; t) X ~ 2 iS(R; t) i~ A(R; t) exp = hψ| He |ψi− ∇I A(R; t) exp . ∂t 2mI ~ I ~

After expansion, this expression can be separated into its real and imaginary com- ponents,

2 ∂S X 1 2 2 X 1 ∇I A + (∇I S) + hψ| He |ψi = ~ (2.1.11) ∂t 2MI 2MI A I I 2 ! ∂A X ∇I S + ∇ · A2 = 0 . (2.1.12) ∂t I m I

The first of these, (2.1.11), can be written in a form that is isomorphic to the classical Hamilton-Jacobi equation in the classical limit, i.e. ~ → 0, leading to

∂S X 1 2 + (∇I S) + hψ| He |ψi = 0 . (2.1.13) ∂t 2MI I

This is the equation we will be primarily concerned with, but it should be noted that (2.1.12) resembles a continuity equation, since hχ|χi = A2 can be interpreted as a nuclear density, as we can verify from (2.1.10). (2.1.11) and (2.1.12) together form the equations of “quantum hydrodynamics”2. Noting that the second and

2As a matter of historical note, the equations of quantum hydrodynamics were derived by Erwin Madelung, after whom they are sometimes named, shortly after Schr¨odinger’sseminal publication. They also act as a springboard into Bohmian mechanics from the Schr¨odinger equation. 40 CHAPTER 2. EXPOSITION third terms of (2.1.13) resemble a kinetic and potential energy, we can combine them into a Hamiltonian, where the momentum conjugate to each RI is PI = ∇I S, ∂S + H (R, ∇I S) = 0 . (2.1.14) ∂t ˙ Invoking the Newtonian equations of motion, PI = −∇I Ve(RI ), where Ve(RI ) is the potential to which the nuclei are subjected, hψ| He |ψi,

dPI = −∇I hψ| He |ψi dt ¨ (2.1.15) MI RI = −∇I hψ| He |ψi

= −∇I Ve(RI ) .

Taking the expectation value of He over all electronic degrees of freedom makes

Ve(RI ) a function of the nuclear degrees of freedom. The coupled electronic degrees of freedom are evolved in time by invoking (2.1.8), which in the classical limit, where the nuclei are described by delta functions, becomes

2 ∂ X ~ 2 i ψ(r; t) = − ∇ ψ(r; t) + Vneψ(r; t) ~∂t 2m i i i (2.1.16)

= He(r; R)ψ(r; R) .

Here, both He(r; R) and ψ(r; R) depend parametrically on the nuclear positions. (2.1.15) and (2.1.16) constitute the TDSCF equations in a form that is amenable to computation. The potential energy function, Ve(RI ), is often referred to as a potential energy surface (PES), and describes a mapping from the conformational manifold of the nuclear coordinates to the molecular energy.

Behler[9] has given a useful semantic characterisation of the two methods one can implement in approximating the PES. The “physical” approach consists of employing a number of functions that intuitively recreate the behaviour one would expect from a potential. Classical force fields, as we discuss in Section 2.1.2, are one such physical approach. Embedded atom type (EAM) potentials can be used 2.1. MOLECULAR MECHANICS 41 for metals[10], bond-order based potentials like those introduced by Tersoff[11] or Brenner[12] for covalent systems, etc.

The “non-physical” approach entails fitting a number of functionally flexible, but physically irrelevant, functions to ab initio data. Splines are the natural choice for such a task, and have been employed for a number of low-dimensional PESs[13], but their implementation in high-dimensional spaces renders them of little use for the majority of molecular systems[14]. A number of other techniques exist for this pur- pose, such as Sheppard interpolation[15], Gaussian approximation potentials[16], machine learning[17] and genetic programming[18]. We briefly review a few of these methods in Section 2.4.

A third category, which Behler does not allude to, is the use of quantum mechanical methods to construct the PES “on-the-fly”, and perform dynamics directly on the ab inito PES. Naturally, this method is more costly than both the physical and non-physical approaches discussed above, but has proven to be an invaluable tool in the study of molecular dynamics[19, 20]. We review a few of these methods in Section 2.1.3.

2.1.2 Classical Force Fields

A commonly invoked approximation to the molecular PES is the use of a number of simple empirical potentials that are (typically) parameterised from quantum mechanical calculations. Broadly speaking, the PES is split up into two general contributions; the bonded and non-bonded energies, Eb and Enb, respectively. Enb is further split into three contributions; the electrostatic Eelec, van der Waals EvdW and correlation Ecorr energies. 42 CHAPTER 2. EXPOSITION

Given that the PES arises from a quantum mechanical treatment of the molecular system in question, we can immediately see that this splitting is artificial, since such a distinction is not made in the quantum mechanical treatment. However, the distinction between bonded and non-bonded terms allows for an entirely classical modelling of the molecular system, with physically intuitive origins, as we now demonstrate.

Bonded Terms

The bonded contributions to the PES approximately model the energetics arising from atomic valency. Bonded atoms are treated as hard, impenetrable spheres connected by flexible linkages. The energetics of this system can then be modelled classically by expressing the bonded energetics as a power series in the displace- ments of the system form an minimum energy conformation. Modelling is typically undertaken in an internal coordinate basis, the mathematics of which we rigorously outline in Section 3.5. For the moment, we qualitatively outline the main features of the internal coordinates.

The bonded terms are formed from n-body interactions, where n = 2, 3, 4, sub- ject to there being some bonding network linking the atoms involved in the n- body interaction3. The energy associated with each interaction is then denoted

E1−n(r1, ... , rn), where r is an atomic position vector.

Given two atoms with position vectors ri and rj, we seek a functional form to 4 the interatomic potential, E1−2(ri, rj). The distance between two such atoms is

3Defining this bonding network as “covalent” is not necessarily valid here, since hydrogen bonds and ionic bonds can also be subject to the treatment outlined. 4 p For the remainder of this work, we will denote all normed quantities as ||a||p = (a1 + ... + p 1/p p ad) , where d is the dimensionality of the vector a and p corresponds to the L space upon which we are invoking the norm. For the case where p = 2, we obtain the standard Euclidean 2.1. MOLECULAR MECHANICS 43

denoted rij = ||ri − rj||. The interatomic potential arising between these two bonded atoms is conventionally modelled as a power series, truncated at finite order,

r 2 r 3 E1−2(ri, rj) = E(r0) + k2(rij − r0) + k3(rij − r0) + ... , (2.1.17) where E(r0) is the value of the interatomic potential at the minimum energy r r interatomic distance, r0, and k2, k3, ... are suitably chosen bond “force constants”, which act as expansion coefficients to the quadratic, cubic, etc., terms of the

E1−2(ri, rj) expansion. Truncation of the expansion at the quadratic term yields a harmonic potential for interatomic interactions, while higher order terms offer anharmonic corrections to the harmonic function.

A number of difficulties arise with the above model. For instance, it has been found that the force constants are actually dependent upon the interatomic distance[21]. Aside from the dynamical nature of the force constants, one can approximate their values by a number of techniques. The most popular technique appears to be ap- proximating the harmonic force constants from the quantum mechanical Hessian of the system (see for example [22]). On occasion, the resultant force constants are empirically modified to agree with spectroscopic data and band assignments[23]. Determining the order to which the power series is to be truncated is not en- tirely clear; a number of popular force fields simply truncate the power series at the quadratic term, whilst more accurate, unfavoured force fields continue to the quartic term[24]. As a final note, we wish to point out that a number of alternative functional forms are available for the modelling of E1−2(ri, rj), which do not use the power series given in (2.1.17), such as the shifted finitely-extensible non-linear elastic (FENE) potential[25] and the AMOEBA bond potential[26]. norm. This will be the default norm invoked over the course of this work, so we drop the subscript and simply refer to this quantity as ||a|| for clarity. This is in contrast to the magnitude of a scalar quantity, a, which we denote by |a|. 44 CHAPTER 2. EXPOSITION

Consider now that the atom at ri is involved in 1-2 interactions with atoms at rj and rk. Then, the angle subtended by rj, rk from ri, θijk, can also be used in a power series expansion for the angular energy,

θ 2 θ 3 E1−3(ri, rj, rk) = E(θ0) + k2(θijk − θ0) + k3(θijk − θ0) + ... , (2.1.18)

where E(θ0) is the value of the angular potential and the minimum energy angle, θ θ θ0, and the force constants k2, k3, ... are expansion coefficients. Similar to the

E1−2(ri, rj) potentials, the vast majority of force fields use a truncated power series at the quadratic term, whilst some more specialist force fields use terms up to the quartic. A number of more complex forms do exist, such as that outlined in [27], where multiplicative polynomials are used to make the potential resemble more of a damped harmonic oscillator. The three-body potential developed for flexible water molecules by Kumagai et al.[28] is an attempt to recreate the vibrational modes of motion for the molecule, and uses a number of additional functions to achieve this. However, much like the case for E1−2(ri, rj), such expressions are only implemented in highly specialised small molecule force fields, and do not enjoy mainstream implementation.

The final bonded term we consider is the four body potential, where ri is linked to rk, rj is linked to rl, and ri is linked to rj. We are able to define two planes; the

first contains the position vectors ri, rj, rk, while the second contains the position vectors ri, rj, rl. The angle between these two planes when viewed down the vector ri − rj, τijkl, is then expanded in a power series to approximate the torsional potential,

τ 2 τ 3 E1−4(ri, rj, rk, rl) = E(τ0) + k2 (τijkl − τ0) + k3 (τijkl − τ0) + ... , (2.1.19)

where E(τ0) is the vale of the torsional potential at the minimum potential value, τ τ τ0, and the force constants k2 , k3 , ... are expansion coefficients. One typically 2.1. MOLECULAR MECHANICS 45

finds, however, that low order power series are inadequate approximations to the torsional potential. The torsional potential is found to be periodic, with energetic barriers easily surmountable over the course of a conventional MD simulation. As such, one is usually required to employ a truncated Fourier series to account for the periodicity[29],

X kτ   E (r , r , r , r ) = n 1 + cos(nτ + φ ) , (2.1.20) 1−4 i j k l 2 ijkl n n

τ where the kn are the torsional force constants, n is the order to which the Fourier series is truncated, and φn are phase factors. Few alternative functional forms for E1−4(ri, rj, rk, rl) exist; the truncated Fourier series is an excellent function for modelling the energetics of torsional degrees of freedom. However, it is the coupling between torsional degrees of freedom and the electrostatics of a system[30] that can lead to problems in parameterising E1−4(ri, rj, rk, rl). However, decoupling the two is not feasible in the majority of classical force fields[31], and so is not discussed further here.

Electrostatic Energy

Consider the Coulombic interaction between two charged entities in a vacuum, placed at positions ri and rj. The force, Fij, between the two is inversely propor- tional to the square of the distance separating them, rij, and is given by

qiqj (ri − rj) Fij = 2 , (2.1.21) 4π0rij rij where qi and qj are the charges of the respective particles and 0 is the permittivity of free space. Introducing a polar solvent between the two particles leads to the assimilation of flux lines between the two particles by the solvent, reducing the 46 CHAPTER 2. EXPOSITION strength of interaction. This effect can be modelled in an average way by multi- plying the permittivity of free space, 0, by an empirical factor, , the permittivity of the medium which separates the particles, such that Coulomb’s law is modified to read

qiqj (ri − rj) Fij = 2 . (2.1.22) 4π0rij rij

2 2 Comparison between (2.1.21) and (2.1.22) shows that rij has been scaled to rij, essentially increasing the effective distance over which the force acts between the two particles. Considering the permittivity of water relative to vacuum is 80, the √ distance between the two particles is effectively increased by a factor of 80 ≈ 9 when the particles are solvated in water.

The implementation of so-called “implicit solvation” models is, however, not quite as trivial as one might envisage. For a molecular system, there is a surrounding cavity which cannot be penetrated by other molecules owing to Pauli repulsion. The surface of this cavity is referred to as the “solvent-accessible surface area”, and is typically modelled as a superposition of atom-centred spheres of radius equal to the atomic van der Waals radius. Hence, the dielectric permittivity is only considered to be greater than unity outside of the cavity[32].

A plethora of such implicit solvation models exist, perhaps the most popular be- ing the “Polarisable Continuum Model” (PCM)[33]. Consider a system (and as- sociated cavity) in a dielectric medium of a given permittivity. The electrostat- ics5 associated with the system polarises, and subsequently induces charges, in the surrounding dielectric medium. As a result of these induced charges in the

5We leave the term “electrostatics” purposefully vague- the subtleties will be discussed in Section 2.3. 2.1. MOLECULAR MECHANICS 47 medium, the electrostatics of the system are polarised. Obviously then, this se- quence of mutual polarisation between system and dielectric requires iteration to self-consistency, leading to the “Self-Consistent Reaction Field” (SCRF) model.

The energetics associated with this process acts as a perturbation to the isolated system energetics. So, for instance, while a charge-separated species (such as a zwitterion, a molecular system we work with in great detail in Chapter6) is not stable in the gas phase, the energetics associated with the SCRF stabilises, and permits the existence of, the charge-separated species in a high dielectric solvent (such as water). The “Conductor-Like Screening Model” (COSMO)[34] is another popular implicit solvation model, except here the cavity is embedded in an infinite-dielectric medium (i.e. a perfect conductor). In so doing, there is no iteration to self-consistency since the charges induced in the dielectric are immediately dissipated. Neglecting the iteration to self-consistency leads to a significant increase in computational speed.

The prevalent means for modelling molecular electrostatics is by the use of point charges placed at atomic centres[35, 36]. The simplicity of such a model allows for quick computation of electrostatic interactions. By use of Coulomb’s law, the total electrostatic energy of a molecular system is given by

N 1 X qiqj Eelec = , (2.1.23) 2 4π0rij i6=j where N is the number of atoms, and the factor of a half corrects for double- counting. However, this method is not as simplistic as first appears. An atom in a molecule rarely has an integer charge; electrons are free to distribute themselves throughout the molecule, subject to the electronegativity equalisation principle[37].

As such, we are required to use the set of partial charges, {qN }, and fit them so 48 CHAPTER 2. EXPOSITION as reproduce the true electrostatic potential. We discuss such a method, which utilises the ab initio molecular electrostatic potential (MEP), Φ(r), defined by[38]

N Z ∗ 0 0 X Zi ψ (r )ψ(r ) Φ(r) = − dr0 , (2.1.24) ||R − r|| ||r0 − r|| i=1 i

th where the first term denotes the potential arising from the i nucleus (Zi is the corresponding atomic number while Ri is its position), and the second term is the potential resulting from the electronic distribution of the system. Partial charges are then assigned to each of the N atoms by choosing them such that they reproduce the MEP. It may be readily seen that doing so is equivalent to minimising a sum of residuals error function, E[39],

2 ntot N ! X X qi E = − Φ(r ) , (2.1.25) ||R − r || n n=1 i=1 i n

th where qi is the partial charge associated with the i atom and ntot is the number of external points at which the MEP is sampled.

The simplicity of such a model is, of course, appealing. However, the appeal of computationally efficient models should wane with an increase in computational power. Partial charges have been used in simulation since the advent of in silico chemical modelling, and so one would now assume them to be redundant. Several key problems arise with the usage of partial charges. Analysis of the above function should quickly reveal that this model is only applicable to a static molecule in a single conformation. If a molecule adopts even a slightly different conformation, a new set of partial charges is required to account for electronic redistribution throughout the molecule, resulting in an entirely different MEP. 2.1. MOLECULAR MECHANICS 49

It seems apparent that as ntot → ∞ in (2.1.25), the set of partial charges should become as optimal as possible in reconstructing the MEP. However, it has been shown[40] that for large values of N, no finite value of ntot can produce a unique set of {qN } which give an optimal solution to the least squares problem. Indeed, there are solutions for which certain charges within the set {qN } may be set to zero, needless to say a nonsense situation. Groups in the past have attempted to validate such solutions by the proposition of ‘buried charges’, where certain atoms in a molecule wouldn’t contribute to the MEP as they were positioned too deeply within the molecule, resulting in their individual potentials being sequestered by other atomic potentials[41]. However, such a theory is speculative, and is perhaps attributable to the fitting used in deriving the partial charges[42].

A final point that requires discussion is that Eelec in the form employed in (2.1.23) is not amenable to computation. The inverse distance dependence means that

Eelec decays very slowly, so a great deal of interactions require evaluation if one is to capture all meaningful contributions to the electrostatic energy. One can in principle employ a “cutoff distance”, where Eelec is evaluated if rij < rcutoff . However, the use of a cutoff distance is inadvisable, since major artifacts are found to arise over the course of a simulation[43, 44].

The matter is complicated further when periodic boundary conditions are em- ployed, where (2.1.23) must be modified to sum over periodic images of the simu- lation cell, N 1 X0 X qiqj E = , (2.1.26) elec 8π r + ||Ln|| 0 n i,j ij where the sum over n is a sum over all periodic box replica lattice vectors, L is the length of the box and the primed summation implies we omit the i = j term when n = 0. This sum is conditionally convergent, and diverges when the system is not electrically neutral[45]. 50 CHAPTER 2. EXPOSITION

The Ewald summation[46], and its various incarnations, have been employed to split the above Coulombic interaction into a sum of a short-ranged real space term and a long-ranged reciprocal space term, both of which are absolutely convergent. We present the result without derivation,

N " 1 X X0 erfc(α||rij + n||) E = q q + elec 8π i j ||r + n|| 0 i,j n ij ! # (2.1.27) 1 X 4π2 π2||k||2 exp − cos(k · r ) , πL3 ||k||2 α2 ij k6=0 where α is a damping factor, controlling the convergence of the Ewald summation and k is a reciprocal vector, equal to 2πn/L2[47]. Much literature is available on the Ewald summation and its computational complexity, and the reader is directed to several excellent reviews for a more thorough discussion[48, 49].

van der Waals and Correlation Energies

The van der Waals and correlation energies are best explained by first introducing a popular potential for their modelling. The Lennard-Jones function[50], V LJ , is one such potential, and is given by " !12 !6# LJ σij σij EvdW (ri, rj) + Ecorr(ri, rj) = Vij = 4ij − , (2.1.28) rij rij where rij is the distance between the atoms at ri and rj, σij is the distance at which the inter-particle interaction is equal to zero, and ij is the depth of the well LJ at the lowest energy point on Vij . It is worth briefly mentioning that both the parameters σ and  are defined for atoms, and mixing rules must be invoked to obtain the quantities σij and ij. These mixing rules are entirely empirical, and are not based upon a physical theory. A number exist (see [51] for an enumeration), but we do not detail them further here. 2.1. MOLECULAR MECHANICS 51

−12 LJ The rij term is repulsive, and dominates the short-range portion of Vij , ensuring that particles never come too close to one another. The theoretical basis for this term originates from the quantum mechanical exchange interaction (Pauli repul- sion), in that fermions (the electrons) cannot occupy the same quantum states.

−12 While one is unable to derive a rij functional form from theory, it is suitable for preventing particles from approaching one another, and is also computationally

−6 ideal since one can compute it from squaring the rij term, which we now describe.

−6 In contrast, the rij term is attractive everywhere. Unlike the repulsion term, this “dispersion” term is theoretically-grounded, and derives from the van der Waals interactions between two particles. Atoms and molecules possess dipole moments that are either permanent or induced. The permanent dipole moments are features of a higher order treatment of the multipole moments of a system, as we describe in Section 2.3. The induced dipole moments are a result of the molecular electrostatics being conformationally-dependent; as atoms reconfigure themselves, electronic distributions, and consequentially multipole moments, change. However, no classical force field currently allows for a conformationally-dependent treatment of electrostatics. As such, the interactions involving induced dipole moments are introduced as “bolt-on” terms.

The van der Waals interactions comprise three distinct components: (1) keesom interactions, Permanent dipole - Permanent dipole (“Orientation”); (2) debye interactions, Permanent dipole - Induced dipole (“Induction”), and; (3) lon- don interactions, Induced dipole - Induced dipole (“Dispersion”). One can −6 show that each of these terms combined leads to an rij law for the potential en- ergy, which we are unable to derive in completion here since it involves a lengthy derivation. The reader is, however, directed to the excellent review of van der Waals interactions for a complete mathematical derivation[52]. 52 CHAPTER 2. EXPOSITION

Alternative functional forms are available beyond the Lennard-Jones potential. The majority fall into one of two categories; the first category are the Lennard- Jones-like potentials, using two independent inverse distance terms to describe the repulsion and dispersion interactions. Examples include the shifted force n-m potential[53], the shifted Weeks-Chandler-Anderson potential[54] or the AMOEBA 14-7 pair potential[26]. The second category employs exponential functions, such as the Morse, Buckingham and Born-Huggins-Meyer potentials[55].

2.1.3 ab initio Molecular Dynamics

Let us consider the computation of hψ| He |ψi by expanding the electronic wave- function into a suitably chosen basis set. For our purposes, we will find it con- venient to use the eigenfunctions of the electronic Hamiltonian, {φk(r; R)}, or “adiabatic basis”, ∞ X ψ(r; t) = ck(t)φk(r; R) , (2.1.29) k=0 P 2 where ck(t) is a time-dependent complex coefficient, and the constraint k |ck(t)| = 1 is required to ensure the wavefunction is normalised. We also note that the adi- abatic basis functions are orthogonal, i.e. hφi|φji = δij, the Dirac delta function. Invoking the TDSE of (2.1.1),

∂ i ψ(r; t) = Heψ(r; t) ~∂t ∞ ∞ X ∂ X i ck(t)φk(r; R) = Heck(t)φk(r; R) ~ ∂t (2.1.30) k=0 k=0 ∞ " # ∞ X ˙ X i~ φk(r; R)˙ck(t) + ck(t)φk(r; R) = ck(t)Ekφk(r; R) , k=0 k=0 where Ek is the energy associated with eigenfunction φk(r; R), and we have used the dot notation in the conventional sense of denoting a (partial) temporal deriva- 2.1. MOLECULAR MECHANICS 53

tive. Multiplying through by hφm|,

∞ " # ∞ X ˙ X i~ hφm|φki c˙k(t) + ck(t) hφm|φki = ck(t)Ek hφm|φki k=0 k=0 (2.1.31) " ∞ # X ˙ i~ c˙m(t) + ck(t) hφm|φki = cm(t)Em . k=0

Writing the above expression as the time evolution of cm(t), we have[56]

∞ X ˙ i~c˙m(t) = cm(t)Em − i~ ck(t) hφm|φki . (2.1.32) k=0 We can use the chain rule to rewrite the time derivative of the electronic eigen- functions, where ∂ ∂R φ = ∇ φ · I = R˙ · ∇ φ , ∂t k I k ∂t I I k and we rewrite (2.1.32) as

∞ X X ˙ i~c˙m(t) = cm(t)Em − i~ ck(t) RI · hφm|∇I φki . (2.1.33) k=0 I The coupled equations of motion using the above results can then be formally given in terms of the expansion coefficients

∞ ¨ X 2 X ∗ MI RI = − |ck(t)| ∇I Ek − ck(t)cm(t)(Ek − Em) hφk|∇I φmi , (2.1.34) k=0 k,m ∞ X X ˙ i~c˙m(t) = cm(t)Em − i~ ck(t) RI · hφm|∇I φki , (2.1.35) k=0 I where the first equation can be trivially obtained through expanding hψk| He |ψmi. This formalism is referred to as Ehrenfest Molecular Dynamics, and includes non- adiabatic transitions between electronic eigenfunctions[57]. When the electronic wavefunction exists in a superposition of quantum states, i.e. more than one coefficient is non-zero, the PES exists as a linear combination of contributions from the relevant eigenfunctions, and the nuclei are said to evolve classically upon the “mean-field” PES[58]. 54 CHAPTER 2. EXPOSITION

One can immediately appreciate the computational effort required to use Ehrenfest MD. Since both the electronic wavefunction and nuclei require numerical integra- tion, the user is limited to a dynamical timestep which allows appropriate integra- tion of the electronic degrees of freedom. Since the electrons move far quicker than nuclei, this timestep is orders of magnitude smaller than the conventional timestep one is permitted to use in classical MD, and the process is highly time-consuming for chemically relevant timescales[8]. Naturally, one also requires a finite basis to replace the infinite summations over basis states to make the problem computa- tionally tractable.

One can also conceive of a contrasting form of dynamics, where a dynamical timestep suitable for the nuclear degrees of freedom can be used at the expense of solving the time-independent Schr¨odingerequation at each step. This approach is referred to as Born-Oppenheimer MD. However, a slight caveat is also introduced, in that dynamics can only be performed on the PES corresponding to a single basis state at any one time, i.e. there is no mean-field variant of Born-Oppenheimer MD where basis states are allowed to interfere. Of course, one can perform dynamics on the PES arising from an excited state, but non-adiabatic transitions cannot be modelled. We constrain our discussion to the ground state eigenfunction, φ0.

A number of quantum chemical software packages possess some form of Born- Oppenheimer MD functionality, typically in the form of Atom Centred Density Matrix Propagation (ADMP)[59, 60], which has equivalent functionality to Born- Oppenheimer MD, but with a great reduction in computational cost. The coupled equations of motion for Born-Oppenheimer MD are

¨ n o MI RI = −∇I min hφ0| He |φ0i (2.1.36) φ0

Heφ0 = E0φ0 (2.1.37) 2.2. ATOMIC PARTITIONING 55

At every dynamical step, a minimisation of the expectation of the electronic Hamil- tonian with respect to the basis state φ0 is required. The variation of φ0 such that hφ0| He |φ0i is minimised can be solved by the conventional quantum chemical techniques, such as the Hartree-Fock or density functional theories[61].

It is impossible to systematically outline every type of ab initio molecular dy- namics that is available. However, we make special mention of the path integral MD (PIMD)[62] and Car-Parrinello[63] approaches. PIMD employs the fact that quantum mechanical path integrals of a system are isomorphic to a ring polymer, whose dynamics are amenable to computational treatment. The Car-Parrinello is arguably the most successful form of ab inito MD, and builds upon the Born- Oppenheimer treatment discussed above. However, as opposed to continually solv- ing the electronic Schr¨odingerequation and updating the nuclear positions in the resultant field, the Car-Parrinello scheme includes electronic dynamics as fictitious degrees of freedom in an extended Lagrangian.

2.2 Atomic Partitioning

The finite nature of matter has been a topic of scrutiny since ancient times. The musings of Democritus, Empedocles and Anaxagoras, while fascinating, are super- fluous to our current discussion. The inherent wave-nature of electrons prevents their localisation to points in space. Instead, an electron is described by a wave- function, ψ(r), where r ∈ R3. The Born rules allows one to characterise the expectation value of an observable, represented by an operator, O, as

Z hOi = drψ∗(r)Oψ(r) . (2.2.1) R3 56 CHAPTER 2. EXPOSITION

The integral over all space is problematic when one wishes to assign an observable to an atom within a molecule. Rather, it would be preferable to integrate over some atomic domain, and rigorously demonstrate that the resultant expectation of that observable corresponds to an atomic property.

Several methods have been developed which allow for the rigorous partitioning of an “atom in a molecule”. Such partitioning schemes allow for the the electron density to be comparmentalised into finite domains. These domains can subse- quently be integrated over to yield electronic properties associated with an atom in a molecule. For example, integration of the electron density, ρ(r), over an atomic domain, Ω, characterises the electronic population associated with that atom,

Z q(Ω) = drρ(r) . (2.2.2) Ω

Note that if one wished to compute the atomic charge associated with the atom, the nuclear charge would be subtracted from the above equation. In the following, we give a brief outline of a number of popular methods for characterising an atom in a molecule.

2.2.1 Hirshfeld Partitioning

We can model the electron density of a molecule comprising N atoms as a linear combination of N spherically symmetric nuclear-centred atomic electron densities,

N pro X at ρ (r) = ρi (r) , (2.2.3) i=1

at where ρi (r) are the individual atomic electron density functions, and we have introduced the notion of a “promolecule” density, ρpro(r)[64]. A scalar field for 2.2. ATOMIC PARTITIONING 57 each atom, termed the sharing function, is introduced, ρat(r) w (r) = i , (2.2.4) i ρpro(r) denoting the proportion of the electron density at r belonging to the ith atom. Nat- P urally the sum of weighting functions for a given r is normalised, i.e. i wi(r) = 1. Now, if one possesses the true6 electron density of the molecular system, ρ(r), atomic electron densities can be constructed by use of the sharing function,

ρi(r) = wi(r)ρ(r) , (2.2.5)

th where ρi(r) denotes the “true” electron density function associated with the i atom. Atomic properties, such as the charge associated with an atom, are then defined by Z qi = − drρi(r) . (2.2.6)

We can immediately take issue with the Hirshfeld partitioning scheme by noting that the partitioning does not define finite atomic domains- the atomic domains are said to be overlapping. Atomic electron densities are delocalised throughout all space, so we conclude that atomic properties are dictated by the system in which the atom is situated. This conclusion is in contrast to the notion of a functional group, where certain atomic properties are conserved regardless of the system[65]. We take the example of lithium fluoride and lithium hydride. Contour diagrams of the molecular electron density reveal the similarities in the electron density of lithium. In spite of this, Hirshfeld partitioning predicts atomic charges of +0.59 and +0.41 for the lithium atom in LiF and LiH, respectively[66].

Independent of the criticism we have just outlined, Parr and Nalewajski[67] have used information theory to show that the Hirshfeld partitioning corresponds to a

6We use the word “true” loosely. By true electron density, we mean an electron density as obtained from experimental or ab initio data. 58 CHAPTER 2. EXPOSITION maximal conservation of information by isolated atoms. Later work has shown this result to be completely dependent upon the entropy function used to measure he information content of a system, and the work of Parr and Nalewajski to be valid only for logarithmic entropy measures[68].

Owing to its simplicity, the Hirshfeld partitioning scheme has been used extensively for deriving a number of atomic properties. For example, the Fukui function[69], a quantity arising in conceptual DFT to probe the reactivity of a system when electrons are either added or removed, have been localised to atoms in a molecule. The computed atomic Fukui functions correlated well with reaction site selectivities as found from experimental studies[70].

Whilst initially formulated to yield atomic properties, Hirshfeld partitioning is easily extensible to molecular properties. Water molecule polarisabilities from clusters[71] and the partitioning of crystalline electron densities into molecular electron densities[72] have been shown to be amenable to such a treatment.

2.2.2 Bader’s Atoms in Molecules

The development of molecular orbital theory largely caused a decline in chemical understanding of the theoretical description of molecular systems. The fact that each electron occupies a molecular orbital dispersed across the spatial extent of the molecule gave rise to a valid query: why do functional groups impart some property, such as reactivity, to the molecule, when the distribution of electrons throughout the system is essentially no different to the “un-functional7” molecule? Surely there must be some localisation of electrons to the functional group, which

7We use “un-functional” as a loose term, denoting the molecular species lacking the functional group 2.2. ATOMIC PARTITIONING 59 permits subsequent functionality. If this is not the case, then even the most funda- mental chemical concepts, such as those of nucleophiles and electrophiles, have no theoretical grounding. These terms are used to denote a property of an atom in a molecule, which is not recovered by molecular orbital theory. For example, if one assesses butanol by means of molecular orbital theory, the electrons ‘belonging’ to the hydroxyl group are dispersed throughout the entirety of the molecule. If this is truly the case, then it becomes particularly problematic when one attempts to explain why the presence of the functional group imparts reactivity (an electronic phenomenon), when the electrons are not localised.

Bader[5] has addressed this problem by suggesting that atoms are open systems within a molecule, partitioned from one another by boundary surfaces that are recovered by the Laplacian of the molecular electron density. In this way, one is able to unambiguously define an “atom in a molecule”.

The gradient of a scalar field is a vector field, the vectors of which are directed along the path of greatest increase in the scalar field. As such, the vectors which define the field ∇ρ(r) point towards the greatest increase in the scalar field ρ(r). By simple vector calculus, one is able to show that the vectors ∇ρ(r) orthogonally intersect constant electron density isosurfaces[73].

Points in the vector field satisfying ∇ρ(r) = 0 are termed critical points, and represent a maximum, minimum or point of inflexion within the scalar field ρ(r). The identity of each critical point is revealed by assessing the curvature of ρ(r) at each point, i.e. calculation of the Hessian of ρ(r), H(ρ).

Since H(ρ) is real and symmetric (Hermitian), it may be diagonalised by its eigen- vectors, the principal axes of curvature. The resultant eigenvalues correspond to the values of the curvature along each of the principal axes. The nature of the 60 CHAPTER 2. EXPOSITION critical point in question is then given by two easily evaluated parameters; the rank (ω) and signature (σ) of the critical point, where the former is defined as the number of non-zero eigenvalues of ρ(r), and the latter being the sum of the signs of the eigenvalues. As such, each critical point is fully characterised by (ω, σ), all variations being summarised below[74].

(ω, σ) Description Corresponds to

Maximum in ρ(r). Found primar- Maximum in all three principal (3, −3) ily at nuclear positions (nuclear axes of curvature. attractors).

Maximum in two principal Found at inter-nuclear positions (3, −1) axes but minimum along third between nuclei (bond critical axis. points).

Minimum in two principal axes Found within the centre of ring (3, +1) but a maximum along the structures (ring critical point). third.

Minimum in all three principal Found within the centre of cage (3, +3) axes of curvature. structures (cage critical point).

A fundamental result in the topology of, of great importance in the following, is the partitioning of a molecular system into topological atoms. A key feature necessary to achieve this result is the gradient path. An easy way to grasp what this is to think of a succession of very short gradient vectors, one after the other and constantly changing direction. In the limit of infinitesimally short gradient vectors, one obtains a smooth and (in general) curved path, which is the gradient path. A gradient path always originates at a critical point and terminates at another critical point. Bundles of gradient paths form a topological object depending 2.2. ATOMIC PARTITIONING 61 on the signature of the critical points that the object connects. All possibilities have been exhaustively discussed before[75] but three ubiquitous possibilities are specified as follows:

1. The topological atom is a bundle of gradient paths originating at infinity and terminating at the nucleus.

2. The bond path (or more generally atomic interaction line) is the set of two gradient paths, each originating at a bond critical point and terminating at a different nucleus.

3. The interatomic surface (IAS), which is a bundle of gradient paths originating at infinity and terminating at a bond critical point.

For IASs only, the following condition is met

∇ρ(r) · n(r) = 0 , (2.2.7) where n(r) is defined as the vector normal to the IAS. By finding all surfaces which obey this condition, the molecule is completely partitioned into distinct atomic volumes, or “atomic basins”, Ωi, where the subscript denotes the atomic basin associated with the ith atom in a molecule. All key topological features in ∇ρ(r) are summarised in Figure 2.1.

Integration over these atomic basins allows atomic properties Pf (Ωi) to be defined and calculated. The universal formula from which all atomic properties can be calculated is 62 CHAPTER 2. EXPOSITION

Figure 2.1: A contour plot of the electron density of in molecular plane of furan superimposed onto a representative collection of gradient paths. Atoms are repre- sented by black circles, where the gradient paths terminate. Interatomic surfaces are highlighted as solid curves, and contain bond critical points (black squares). A ring critical point (triangle) is also shown in the centre of the furan.

Z Pf (Ωi) = dτf(r) , (2.2.8) Ωi where integration with respect to dτ denotes a triple integration over all three

Cartesian coordinates, confined to the atomic volume Ωi, and f(r) denotes a prop- erty density. For example, if f(r) equals the electron density ρ(r) then the corre- 2.2. ATOMIC PARTITIONING 63 sponding atomic property is the electronic population of the topological atom,

Z N(Ωi) = ρ(r)dτ . (2.2.9) Ωi

If f(r) = 1 then we obtain the atomic volume,

Z v(Ωi) = dτ , (2.2.10) Ωi and when f(r) = ρ(r)R`m(r) the topological atom’s multipole moments[76], where

R`m(r) is a spherical tensor[77] of rank ` and m. Others have shown[78, 79] the better agreement with reference electrostatic potentials of topological multipole moments compared to CHELPG[80] charges. A further advantage of AIM is that the finite size and non-overlapping nature of the topological atoms avoids the penetration effect, which may otherwise appear in the calculation of intermolecular interaction energies.

We wish to clarify the nomenclature that we will adopt throughout this work. Bader has been insistent that the topological atoms as found through an AIM partitioning are open quantum mechanical systems, i.e. quantum atoms. In other words, an AIM partitioning defines an unequivocal quantum atom, i.e. ΩA is a topological atom if and only if ΩA is a quantum atom.

In computing the kinetic energy of an atom, one requires the integral of a kinetic energy density over the corresponding atomic domain. It can be shown that the addition of an arbitrary constant to the kinetic energy density recovers a unique kinetic energy[81] upon integration over the domain of a quantum atom[82]. How- ever, one can show that there exist other spatial domains, that are not topological atoms, which also produce unique kinetic energies when the kinetic energy density 64 CHAPTER 2. EXPOSITION

is integrated over the domain[83]. Therefore, if ΩA is a topological atom, it is a quantum atom, but the reverse does not hold.

We are unconcerned with this one-way implication; the fact that topological atoms are not the only atoms with a unique kinetic energy is unimportant. We are purely interested in the fact that topological atoms partition a molecular system into well-defined atomic domains, over which one can integrate and obtain atomic expectation values[84]. As such, we decouple our usage of the theory of AIM from the claim that AIM provides the atoms of chemistry by referring to the partitioning as Quantum Chemical Topology (QCT).

2.2.3 Interacting Quantum Atoms

Energetic partitioning is a commonly-invoked methodology in quantum chemistry. Being able to obtain individual energetic components can prove to be highly in- formative, and can be used to correlate chemical intuition with quantum mechan- ical processes. Two popular energetic partitioning schemes are attributable to Morokuma[85], and Ziegler and Rauk[86]. We concern ourselves with an energy partitioning scheme, “Interacting Quantum Atoms”, which utilises the atomic par- titioning outlined in 2.2.2, which is gaining in popularity[87].

Strictly speaking, we are interested in partitioning the Hamiltonian of a molecular system, H, into atomic contributions, " # X X H = T A + V AA + V AB , (2.2.11) A B6=A where T A is an atomic kinetic energy, V AA is an intra-atomic atomic potential en- ergy and V AB is an inter-atomic potential energy. The potential energy terms can be further decomposed by qualitative inspection. V AA is composed of electronic 2.2. ATOMIC PARTITIONING 65

AA AA repulsion and electron-nuclear attraction operators, Vee and Ven , respectively, where the electrons and nuclei “belong to” atom A,

AA AA AA V = Vee + Ven (2.2.12)

V AB similarly comprises the electronic repulsion and electron nuclear attraction

AB energies, in addition to a nuclear-nuclear repulsion energies, Vnn , but interactions are between particles in different atoms, i.e.

AB AB AB AB AB V = Vee + Ven + Vne + Vnn , (2.2.13) allowing us to rewrite (2.2.11) " !# X A AA AA X AB AB AB AB H = T + Vee + Ven + Vee + Ven + Vne + Vnn . (2.2.14) A B6=A

Note that the nuclear-nuclear repulsion energy can by immediately given by the classical electrostatic energy between point charges

AB ZAZB Vnn = , (2.2.15) rAB where ZA,ZB are the nuclear charges associated with atoms A and B, and rAB is the interatomic distance. A note is in place on the nomenclature we have adopted when denoting the potential energy terms. The subscript and superscript are

AB in correspondence with one another, so that Ven denotes the potential energy operator for an electron in atom A interacting with the nucleus of atom B. For

AB AB AB BA this reason, we see that Ven 6= Vne , but Ven = Vne .

We now detail the methods required to accomplish the above partitioning for the electronic terms. 66 CHAPTER 2. EXPOSITION

Density Matrices

For a system containing N electrons, we introduce the concept of the “pth order

8 0 0 reduced density matrix” (p-RDM) ,Γp(r1, ... , rp; r1, ... , rp), N Z Z Γ (r , ... , r ; r0 , ... , r0 ) = dr ... dr p 1 p 1 p p p+1 N

∗ 0 0 Ψ (r1, ... , rp, rp+1, ... , rN )Ψ(r1, ... , rp, rp+1, ... , rN ) . (2.2.16) To familiarise ourselves with the p-RDM, we take the special case of the 1-RDM,

0 Γ1(r1; r1), Z Z 0 ∗ 0 Γ1(r1; r1) = N dr2 ... drN Ψ (r1, ... , rN )Ψ(r1, ... , rN ) . (2.2.17)

Integration over the position vector of every electron apart from one removes the dependence of the 1-RDM on the electrons that have been “integrated out”, and we are left with the coherence between a single electron positioned at r and r0.

ˆ PN Take the one-electron operator, O1 = i=1 oˆi(ri), the expectation of which is given by N Z Z X ∗ hO1i = dr1 ... drN Ψ (r1, ... , rN )ˆoiΨ(r1, ... , rN ) . (2.2.18) i=1 Since fermions are indistinguishable, it can be shown that each term in the sum- mation is equivalent, allowing us to write Z Z ∗ hO1i = N dr1 ... drN Ψ (r1, ... , rN )ˆo1Ψ(r1, ... , rN ) , (2.2.19) where the choice ofo ˆ1 is arbitrary. We can reformulate this in terms of the 1-RDM by Z " # 0 hO1i = dr1 oˆ1Γ1(r1; r1) , (2.2.20)

8Formally, we only concern ourselves with the spatial p-RDM, where we make no reference to the spin degree of freedom of each electron. However, one can easily extend the following to account for spin by simply appending the spin to degree of freedom to each electronic state vector, x = r ∧ σ. 2.2. ATOMIC PARTITIONING 67

where we see that the primed coordinate prevents the operatoro ˆ1 from operat- ing on the complex conjugate part of the 1-RDM. Once the operator acts upon

0 Ψ(r1, ... , rN ), we allow r1 = r1, and we recover (2.2.18). Thus, the primed no- 0 tation tacitly implies that r1 and r1 are distinct until the operator has acted on the wavefunction, at which point they become equivalent and integration can be performed.

ˆ We can extend this approach for two-electron operators, O2, where we introduce the 2-RDM

N(N − 1) Z Z Γ (r , r ; r0 , r0 ) = dr ... dr Ψ∗(r0 , r0 , ... , r )Ψ(r , r , ... , r ) . 2 1 2 1 2 2 3 N 1 2 N 1 2 N (2.2.21) Then, Z Z " # 0 0 hO2i = dr1 dr2 oˆ12Γ2(r1, r2; r1, r2) , (2.2.22) whereo ˆ12 is a generic two-electron operator. We have adopted the convention of L¨owdin[88], where a factor of a half is included in the second order density matrix to account for only the unique pairs of electrons. We note that this is in keeping with the combinatorial term in (2.2.16). McWeeny[89], for a not immediately obvious reason, omits this factor, which is seemingly erroneous, but is followed by Blanco et al.[90] in the original work on IQA. We will follow L¨owdin’soriginal definition and maintain consistency with (2.2.16).

One-Electron Potential

We introduce a Heaviside function, ΘΩ(r), which is defined as being equal to unity if r lies within the atomic basin Ω, and is equal to zero otherwise. Since a QCT partitioning is spatially exhaustive, the 1-RDM can be partitioned into atomic 68 CHAPTER 2. EXPOSITION contributions by use of this Heaviside function,

0 X A 0 X A 0 0 Γ1(r1; r1) = Γ1 (r1; r1) = Γ1 (r1; r1)Θ(r1) . (2.2.23) A A We observe that such a partitioning is also valid for single-electron operator quan- tities, since

Z " # Z " # 0 X A 0 hO1i = dr1 oˆ1Γ1(r1; r1) = dr1 oˆ1Γ1 (r1; r1) A Z Z X 0 A 0 X A 0 X A = dr1ΘA(r1)ˆo1Γ1 (r1; r1) = dr1oˆ1Γ1 (r1; r1) = hO1 i , A A ΩA A (2.2.24)

A where hO1 i is the expectation of the one-electron operator O1 associated wtih the atom A.

The kinetic energy can be written

2 Z A ~ 2 0 T = − dr1∇1Γ1(r1; r1) , (2.2.25) 2m ΩA

2 where m is the mass of an electron and ∇1 is the Laplacian with respect to r1.

The electron-nuclear attraction can be written Z AB ρ(r1)ZB Ven = − dr1 , (2.2.26) ΩA ren

AB where the 1-RDM is reduced to the electron density, ρ(r1) since Ven does not alter the wavefunction it operates on. By permuting A and B, we obtain an

AB equivalent expression for Vne . By taking the case where A = B, we also recover AA the expression for Ven . 2.2. ATOMIC PARTITIONING 69

Two-Electron Potential

AB The operator Vee requires the 2-RDM, which can be split into three distinct components;

0 0 corr 0 0 Γ2(r1, r2; r1, r2) = ρ(r1)ρ(r2) − Γ1(r1; r2)Γ1(r2; r1) + Γ2 (r1, r2; r1, r2) , (2.2.27) where the first term refers to the uncorrelated Coulomb density, the second term to the quantum mechanical electron exchange, and the third term to the electron correlation. Then,

Z Z 0 0 AB Γ2(r1, r2; r1, r2) Vee = dr1 dr2 ΩA ΩB r12 Z Z ρ(r )ρ(r ) Z Z Γ (r ; r )Γ (r ; r ) = dr dr 1 2 − dr dr 1 1 2 1 2 1 1 2 r 1 2 r ΩA ΩB 12 ΩA ΩB 12 (2.2.28) Z Z corr 0 0 Γ2 (r1, r2; r1, r2) + dr1 dr2 ΩA ΩB r12 AB AB AB = Vee,coul + Vee,exch + Vee,corr

AB AB where Vee,coul is the Coulomb energy (which we deal with in Section 2.3.2), Vee,exch AB is the Fock-Dirac exchange energy, and Vee,corr is the electron correlation energy. AA Note that by taking the case where A = B, we obtain the expression for Vee . We have then accounted for all of the terms in (2.2.14), and illustrated a complete energy partitioning through IQA. 70 CHAPTER 2. EXPOSITION 2.3 Multipole Moment Electrostatics

2.3.1 Mathematical Details

The Multipole Moments

The Poisson equation in three spatial dimensions is given by

ρ(r) ∇2φ(r) = − , (2.3.1) 0 where φ(r) is the electric potential at r, 0 is the permitivity of free space and ρ(r) is a charge distribution. We assume the system is in some arbitrary coordinate frame. The Green’s function, G(r, r0) for the Poisson equation is defined as

∇2G(r, r0) = δ(r − r0) , (2.3.2) where the Dirac delta function, δ(r − r0) has been introduced. By analysis, and a little foresight, we see that r0 corresponds to a position at which the electric poten- tial diverges, i.e. a charge, as can be verified from Coulomb’s law. By definition, φ(r) is then the convolution of the Green’s function and the inhomogeneous term for the Poisson equation, i.e.

ρ(r) 1 Z φ(r) = −G(r, r0) ∗ = − G(r, r0)ρ(r0)dr0 . (2.3.3) 0 0 R3 The Green’s function which solves (2.3.2) is the Newton kernel,

1 G(r, r0) = − , (2.3.4) 4π||r − r0|| leading us to the solution of (2.3.3)

Z 0 1 ρ(r ) 0 φ(r) = 0 dr . (2.3.5) 4π0 R3 ||r − r || 2.3. MULTIPOLE MOMENT ELECTROSTATICS 71

We rewrite the factor of ||r − r0||−1 by use of a well-known formula from the theory of spherical harmonics[91]

||r − r0||−1 = (r2 + (r0)2 − 2rr0 cos χ)−1/2 !−1/2 1 r0 2 r0  = 1 + − 2 cos χ r r r ∞ ∞ 1 X r0 ` X  r0 ` = P (cos χ) = P (cos χ) , (2.3.6) r r ` r`+1 ` `=0 `=0 where we have used r = ||r||, r0 = ||r0||, χ is the angle between r and r0, and

1 d` 2 ` th P`(r) = 2``! dr` (r − 1) is the ` order Legendre polynomial. Now, consider the spherical angles {Θ, Φ} and {θ, φ} that r and r0 make with some fixed Cartesian frame of reference. The addition theorem of spherical harmonics expresses the Legendre polynomial in terms of these spherical angles as

` X (` − |m|)! |m| |m| P (cos χ) = P (cos Θ)P (cos θ)e−im(Φ−φ) . (2.3.7) ` (` + |m|)! ` ` m=−`

Further, introducing the spherical harmonics of rank `, m, s m 2` + 1 (` − |m|)! m imφ Y`m(θ, φ) = (−1) P` (cos θ)e (m ≥ 0) 4π (` + |m|)! (2.3.8) `−m ∗ Y`,−|m|(θ, φ) = (−1) Y`,|m|(θ, φ)(m < 0) , (2.3.6) can be rewritten to read

∞ ` r X X (r0)` 4π ||r − r0||−1 = Y (Θ, Φ)Y (θ, φ) , (2.3.9) r`+1 2` + 1 `,−|m| `,m `=0 m=−` allowing us to express the electric potential of (2.3.5) as

∞ ` Z 0 ` r 1 X X 0 0 (r ) 4π φ(r) = dr ρ(r ) `+1 Y`,−|m|(Θ, Φ)Y`,m(θ, φ) 4π0 R3 r 2` + 1 `=0 m=−` (2.3.10) ∞ ` r 1 X X Q`,m 4π = `+1 Y`,−|m|(Θ, Φ) . 4π0 r 2` + 1 `=0 m=−` 72 CHAPTER 2. EXPOSITION

Figure 2.2: Vectors r and r0 in a Cartesian axis system, along with their colatitu- dinal and azimuthal degrees of freedom, (Θ, Φ) and (θ, φ), respectively. In this final expression, we have introduced the multipole moment of rank (`, m), Z r 0 0 0 ` 4π Q`,m = dr ρ(r )(r ) Y`,m(θ, φ) . (2.3.11) R3 2` + 1 When ` = 0 (and m = 0 by extension), it is easily shown, with the knowledge that

1 q 1 Y0,0(θ, φ) = 2 π , that the electric potential becomes 1 Z φ(r) = dr0ρ(r0) , (2.3.12) 4π0r R3 which is Coulomb’s Law, describing a spherically symmetric potential about r. Also, with the three spherical harmonics 1r 3 Y (θ, φ) = cos θ 1,0 2 π 1r 3 Y (θ, φ) = − eiφ sin θ (2.3.13) 1,1 2 2π 1r 3 Y (θ, φ) = e−iφ sin θ , 1,−1 2 2π 2.3. MULTIPOLE MOMENT ELECTROSTATICS 73 we find that the multipole moments of rank ` = 1 take the form Z 0 0 0 Q1,0 = dr ρ(r )r cos θ R3

= µz (2.3.14) Z ! 1 0 0 0 Q1,1 = −√ dr ρ(r )r sin θ cos φ + i sin φ sin θ 2 R3 ! 1 = −√ µy + iµx (2.3.15) 2 Z ! 1 0 0 0 Q1,−1 = √ dr ρ(r )r sin θ cos φ − i sin φ sin θ 2 R3 ! 1 = √ µy − iµx , (2.3.16) 2 where we have introduced components of the dipole moment vector, µ, ! r0 Z ˆ 0 0 0 µα = 0 · α dr ρ(r )r (α = x, y, z) . (2.3.17) r R3

In the above, αˆ is a unit vector along the Cartesian axis along which µα is oriented. Higher order multipole moments are obtained in a similar way.

Interaction

Consider the Coulomb interaction between two charge distributions, which requires the computation of an expensive six-dimensional integral Z Z X ρ(rA)ρ(rB) EAB = drA drB . (2.3.18) 3 3 rAB A6=B R R

The following derivation is taken from the extensive work of Stone[92] on the spher- ical tensor formulation for multipole moment interactions. The factor of 1/rAB 74 CHAPTER 2. EXPOSITION

Figure 2.3: Vector quantities required for an expansion, in multipole moments, of the interaction between the charge distributions centred at RA and RB. can be transformed into an expansion in Legendre polynomials, as undertaken in (2.3.6) by recognising that

rAB = (RB + rB) − (RA + rA) = RAB − (rA − rB) , (2.3.19) where we have defined RAB = RB − RA, leading to the relation

∞ ` ` 1 X XA XB = T`A,`B ,mA,mB R`B ,mB (rB)R`A,mA (rA) . RAB − (rA − rB) `A,`B =0 mA=−`A mB =−`B (2.3.20) We introduce the regular and irregular spherical harmonics,

` R`,m(r) = r Y`,m(θ, φ) 1 (2.3.21) I (r) = Y (θ, φ) `,m r`+1 `,m respectively, as well as the geometric “interaction tensor”, T`A,`B ,mA,mB , which is an expression for the mutual orientation of the local axis systems of A and B, in 2.3. MULTIPOLE MOMENT ELECTROSTATICS 75

addition to their distance from their origins, RAB, and takes the form s (2` + 2` + 1)! `A A B T`A,`B ,mA,mB =(−1) × (2`A)!(2`B)! " # (2.3.22) `A `B `A + `B I`A+`B ,−(mA+mB )(RAB) . mA mB −(mA + mB)

The bracketed term in the above expression is a Wigner-3j symbol[93]. The in- teraction tensor for higher order angular momenta can be formulated recursively, which simplifies any resultant computation[94]. The total Coulomb interaction between two charge distributions is therefore given in its entirety by

∞ ` ` X X XA XB Z Z EAB = drA drB 3 3 A6=B `A,`B =0 mA=−`A mB =−`B R R (2.3.23)

T`A,`B ,mA,mB ρ(rB)R`B ,mB (rB)ρ(rA)R`A,mA (rA) . Recalling the expression for multipole moments given in (2.3.11), we can rewrite this to be slightly less unwieldy

∞ ` ` X X XA XB EAB = T`A,`B ,mA,mB Q`B ,mB (rB)Q`A,mA (rA) . (2.3.24)

A6=B `A,`B =0 mA=−`A mB =−`B Two methodological concerns are, however, apparent: (1) infinite spatial integra- tions, as appear in the definition of multipole moments, (2.3.11), are not feasible, and; (2) neither are infinite summations over angular momenta, as appear in the definition of the Coulomb interaction of (2.3.24). To remedy the first issue of infinite spatial integrations, some form of partitioning scheme is required to lo- calise multipole moment expansions over a finite domain. Real space partitioning is desirable, and a Hirshfeld partitioning has been shown to yield stable multipole moment expansions[95]. In our work, we perform integrations over atomic basins, such that multipole moments are defined by[74] Z r A ` 4π Q`,m = drAρ(rA)rA Y`,m(θ, φ) , (2.3.25) ΩA 2` + 1 76 CHAPTER 2. EXPOSITION

A where Q`,m denotes a multipole moment associated with the atomic basin, ΩA. Whilst having to explicitly compute the surfaces of the atomic basin is a compu- tationally intensive process, we have adopted it for the reasons outlined in Section 2.2.2.

The second issue of infinite expansions over angular momenta is easily solved by approximating EAB with a truncated expansion to finite `A, `B. Past work has shown that, by introducing the interaction rank, L = `A + `B + 1, EAB converges for values of L = 5[96, 97], i.e. up to quadrupole - quadrupole, and all rank- equivalent combinations, such as octupole - dipole and hexadecupole - monopole interactions.

A final point that requires clarification is that (2.3.24) involves an irregular spher- ical harmonic that is a function of RAB, and regular spherical harmonics that are functions of rA and rB. One can identify that the interaction diverges unless

||rA − rB|| < ||RAB||. This inequality is used in the evaluation of interactions between multipole moments; the nuclear separation must be greater than the dis- tance over which multipole moment interactions are evaluated. We can quickly satisfy this condition by only evaluating 1-4 and higher electrostatic interactions, although this does not guarantee convergence of EAB. In the case of divergence, one can shift the multipole moment centres to non-nuclear positions in an attempt to enforce convergence[96], and one is in fact able to obtain a value for the “opti- mal” shift to this end algorithmically[98]. The electrostatic interactions involving 1-2 and 1-3 interactions must be treated with some separate short-range electro- static term, accounted for by an IQA treatment, as explained in Section 2.5.2. In classical force fields. however, these effects will be incorporated into the “bonded” terms, as has been described in Section 2.1.2. 2.3. MULTIPOLE MOMENT ELECTROSTATICS 77

Distributed Multipole Analysis

While not strictly relevant to the work undertaken in this thesis, Distributed Mul- tipole Analysis (DMA)[92] has been the major impetus towards a rigorous treat- ment of molecular electrostatics. As such, we find that some discussion of DMA is not only interesting, but also warranted. In conventional ab intio treatments, one-electron wavefunctions (molecular orbitals) are expanded in a basis of atomic

9 orbitals, centred on atomic nuclei , χA(r − RA),

X ψ(r) = cAχA(r − RA) , (2.3.26) A where the summation over A denotes a summation over atomic centres and cA is an expansion coefficient. The atomic orbitals are commonly written as Gaussian functions

h 2i χA(r − RA) = R`,m(r − RA) exp − ζ||r − RA|| , (2.3.27) where ζ is a parameterised exponent. The electron density is subsequently ex- pressed as X ρ(r) = PABχA(r − RA)χB(r − RB) , (2.3.28) A,B where PAB is an element of the density matrix. The above expansion for the electron density is then composed of products of Gaussians,

h 2i χA(r − RA)χB(r − RB) = R`,m(r − RA) exp − α||r − RA||

h 2i × R`0,m0 (r − RB) exp − β||r − RB|| , (2.3.29) where α and β are the exponents for the respective Gaussians. By virtue of the Gaussian product theorem of Boys[100], we are able to write the product of

9There is no technical constraint requiring that the basis functions are centred on atomic nuclei; one could use, for instance, floating spherical Gaussian orbitals[99]. However, we proceed by use of atom-centred basis functions, as the consequent nomenclature is then consistent with what has been introduced in the previous sections. 78 CHAPTER 2. EXPOSITION

Gaussians into a single Gaussian centred at a point, P, along the line RAB

h 2i h 2i exp − α||r − RA|| exp − β||r − RB|| = " # αβ h i exp − ||R − R ||2 exp − (α + β)||r − P||2 , α + β A B (2.3.30) where P = (αRA + βRB)/(α + β), and is referred to as the “overlap centre”[101]. The regular spherical harmonic terms appearing in (2.3.29) must also be moved to P. The movement of R`,m(r − RA), for instance, is accomplished by use of a linear combination of regular spherical harmonics up to rank `,

` k " #1/2 X X ` + m` − m R (r − R ) = R (r − P)R (P − R ) . `,m A k + q k − q k,q `−k,m−q A k=0 q=−k (2.3.31)

The same scheme is valid for R`0,m0 (r − RB), yielding a linear combination of spherical harmonics up to rank `0. In addition, the product of spherical harmonics of rank ` and `0 is a linear combination of spherical harmonics, ranging from ranks |` − `0| to ` + `0,(2.3.29). For instance, if ` = `0 = 1, (2.3.29) would contain a product of expansions containing monopole and dipole moments. The product of monopole moment terms would lead to a monopole moment term. The product of monopole and dipole moment terms would lead to dipole (`−`0 = 1 and `+`0 = 1) moment terms only. Finally, the product of dipole moment terms would lead to monopole (` − `0 = 0), dipole and quadrulpole (` + `0 = 2) moment terms. One can then appreciate that (2.3.28), when fully expanded, becomes a rather lengthy

h 2i linear combination of terms of the form Rkq(r − P) exp − ζ||r − P|| .

Now, if we are to evaluate the “overlap multipole moments” associated with the

h 2i term Rk,q(r − P) exp − ζ||r − P|| , we invoke (2.3.11), Z r h 2i ` 4π Q`,m = drRk,q(r−P) exp −ζ||r−P|| ||r−P|| Y`,m(r−P) , (2.3.32) R3 2` + 1 2.3. MULTIPOLE MOMENT ELECTROSTATICS 79 which is zero unless (`, m) = (k, q), since the spherical harmonics are mutually orthogonal[92]. Then, (2.3.29) comprises overlap multipole moments of ranks

0 ranging from 0 to ` + ` . To clarify this, consider the two Gaussians χA and 0 χB of (2.3.29) have angular momenta ` = ` = 0. Then, the electron density cor- responding to this product of Gaussians is represented by a monopole moment at the overlap centre. If ` = 0 and `0 = 1, then a monopole and dipole moment are required at the overlap centre.

For every Gaussian product in the expansion (2.3.28), the overlap centre is typically different to other Gaussian products in the expansion. These different overlap centres arise because P is defined from the Gaussian exponents, which are typically distinct pairs for each Gaussian product. Therefore, a number of overlap multipole moments are defined at various overlap centres along RAB. The situation is best illustrated by the diagram given by Stone in the original DMA publication[101]. We reproduce this diagram in Figure 2.4, where the values of the overlap monopole and dipole moments for hydrogen fluoride (with a [6s5p3d/4s3p] basis set) are plotted along the internuclear axis at the corresponding overlap centres.

Having overlap multipole moments scattered between nuclei makes the above methodology computationally cumbersome. One can, however, shift the overlap multipole moments to one of the two atomic centres. This shifting is undertaken by a formula similar to that presented in (2.3.31). Denote the distance between the overlap centre and an atomic centre by s. Then, shifting an overlap multipole moment, Qk,m from P to the atomic centre requires that the atomic multipole moment be written as a linear combination of multipole moments, of ranks ` ≥ k. Each of these atomic multipole moments is also multiplied by a term proportional to s`−k. As such, if s is small, the higher order multipole moments are contribute neglibigly to the molecular electrostatics, and the expansion can be truncated at 80 CHAPTER 2. EXPOSITION

Figure 2.4: The overlap monopole (upper graph) and dipole (lower graph) mo- ments at various overlap centres along the hydrogen fluoride internuclear axis. The nuclear charges have been added to the atomic centres. low order. However, this multiplicative term of s`−k also constrains the magnitude of s, since the requirement of including higher order atomic multipole moments becomes burdensome.

A further issue is in the selection of the atomic centre to which the overlap mul- tipole moments are shifted; since the overlap centres are distributed between two atomic centres, the choice is somewhat arbitrary. Perhaps the simplest solution is to shift the overlap multipole moments to the closest atomic centre, thus min- imising the shifting distance and limiting the problems discussed in the preceding paragraph[102]. A second solution is to require that each atomic centre receives ex- actly half of the overlap multipole moments, referred to as the Cumulative Atomic Multipole Moment (CAMM) method. Another popular methodology has proposed the shifting of overlap multipole moments to internuclear centroids[103]. 2.3. MULTIPOLE MOMENT ELECTROSTATICS 81

It does not require much consideration to see that the DMA scheme is highly dependent upon the basis set used. Changing basis functions requires that the overlap centres be redefined. Increasing the accuracy of the basis set introduces a greater number of overlap multipole moments of higher angular momentum. As such, the resultant atomic multipole moments fluctuate wildly with respect to a change in basis set. Additionally, DMA struggles with diffuse basis functions; the exponents of these basis functions are very small, and so the overlap centre migrates further from the atomic centre. The instability of the atomic multipole moments has been identified as arising from the partitioning of the density in the Hilbert space of basis functions. Stone has remedied this shortcoming[104] by the introduction of a numerical quadrature scheme, which essentially transfers the partitioning from Hilbert space into real space within which the electron density is defined.

2.3.2 Implementation

Transferability

The idea of an atom type is inextricably linked with that of transferability. Whilst complex definitions of an atom type have been proposed, this area remains a source of debate and competing methods[105]. It is, however, widely regarded as a necessary measure to define electrostatic properties as pre-defined parameters for large scale molecular simulation to be truly viable.

The generation of a transferable set of multipole moments is a far more delicate operation than trying to find a corresponding set of partial charges. Whilst a monopole moment is relatively transferable, higher order multipole moments are 82 CHAPTER 2. EXPOSITION less so due to their increasing directional dependencies. The latter make it more difficult to obtain a generic set of higher order multipole moments for a given atom type.

Many molecular and group properties are tractable when attempting to demon- strate transferability. One finds that experimental heats of formation, for example, may be reproduced for a generic hydrocarbon CH3(CH2)xCH3, by fitting to a linear relationship, ∆Hf = 2A + xB. Here, A and B represent the respective energies of methyl and methylene groups. Indeed, this function is equally applicable to SCF single-point energies for equivalent systems, such that E = 2E(CH3)+xE(CH2) is accurate to approximately 0.06 kcal mol−1[106]. Based on this additivity of single point energies, the concept is easily extended to imply the additivity of electron correlation energies. Because the correlation energy is a functional of a group’s electron density, it implies that electronic properties must additionally follow this transferability scheme[107].

Armed with this, the demonstration that multipole moments possess[108] some amenability to atom typing should follow. In one case study[109], a set of small molecules composed of the functional groups present in proteins underwent DMA at HF/3-21G level of theory, and the multipole moments of each atom were as- sessed. Atom typing by atomic number or hybridisation state was seen to be ineffective, but atom typing by bonding to specific functional groups proved to be more successful. Two transferable schemes were developed: atom and peptide. The former utilised the average multipole moments for specific atom types gen- erated from the data set mentioned previously. The peptide model features a single multipole moment expansion centre for each distinct amino acid. As such, the local environment for each of these expansions centres is conserved for a given amino acid. The usage of atom resulted in substantial deviations from the ab 2.3. MULTIPOLE MOMENT ELECTROSTATICS 83 initio electrostatic potential while the peptide model gave far more satisfactory results. Extending from this, a grossly distorted cyclic undecapeptide (a derivative of the immunosuppressive cyclosporine) was analysed by the above two models. The authors compared the electrostatic potentials generated by these models with one generated from DMA. Again, the peptide model exhibited lower average errors then atom.

Many years later, Mooij et al.[110] focused on the generation of an intermolecular potential function implemented in dimers and trimers of methanol. Using a fitted electrostatic term in the intermolecular potential resulted in relatively favourable results: 0.2 kcal mol−1 and 1.6 kcal mol−1 deviations in the dimer and trimer ener- gies, respectively, from counterpoise-corrected MP2/6-311+G(2d,2p) calculations. This is still more impressive than similar studies on other less complex systems that have attempted to parameterise point-charge electrostatics[111]. Mooij et al.[110] also worked on methanol - water and methane - dimethylether complexes. Each of these systems was assigned a set of atom-centred multipole moments. Equally impressive results were obtained, with all interaction energies replicated to within 0.2 kcal mol−1 of the corresponding ab initio calculations. As such, it was con- cluded that atom-centred multipole moment expansions are indeed transferable between the same molecules in differing environments.

There are several ways of allocating molecular electronic charge to atoms (e.g. DMA[102] or QCT partitioning[112]). Considering our group’s research interests, we focus here on QCT-based techniques. Working with the molecular energy, Bader and Beddall[82] demonstrated that:

1. The total energy of a molecule is given by a sum over the constituent atomic energies. 84 CHAPTER 2. EXPOSITION

2. If the distribution of charge for an atom is identical in two different systems, then the atom will contribute identical amounts to the total energy in both systems.

Although these conclusions are given in terms of energy, they hold for any property density of an electronic distribution over an atomic basin. In light of this fact, it was shown by Laidig[113] that multipole moments, under certain constraints, adhere to the above conclusions, and so exhibit transferability.

The property of transferable multipole moments was successfully adopted by Bren- eman and co-workers[114], in a method denoted Transferable Atom Equivalents (TAEs)[115]. Primarily, a library was generated consisting of atom-based electron density fragments generated from a QCT decomposition of a set of molecules. DMA was subsequently performed on each of these fragments. These TAEs may then be geometrically transformed into a novel system for which the electrostatic potential is required. The fact that atomic property densities are additive in QCT implies that this recombination of TAEs is sufficient to reproduce an electrostatic potential of the system to a quasi-ab initio level of theory. It should, however, be noted that the transferability of atomic basins is approximate, and so this method will necessarily carry a small error.

The efficacy of this methodology was subsequently demonstrated through three “peptide-capped” molecules: alanine, diglycine and triglycine. The analytical electrostatic potentials were computed on 0.002au isodensity surfaces. Equiva- lent electrostatic potentials were also generated from TAE-reconstructed systems and Gasteiger point-charges for the extended (open) and α-conformations. The TAE multipole analysis (TAE-MA) reproduced the features of the electrostatic po- tential generated at ab initio level much better than the Gasteiger point-charges 2.3. MULTIPOLE MOMENT ELECTROSTATICS 85 did. A point-charge electrostatic model is unable to accurately predict extremes in electronic features.

In later work carried out in this group, all 20 naturally occurring amino acids and their constituent molecular fragments were rigorously assessed using QCT[116]. A set of 760 distinct topological atoms were generated and cluster analysis identified a set of 42 atom types in total (21 for carbon, 7 for hydrogen, 6 for oxygen, 2 for nitrogen and 6 for sulphur). The trivially assigned atom types implemented in AMBER were either too fine-grained (e.g. too many atom types for N) or too coarse-grained (e.g. carbon atom types not diverse enough). Later, an extensive investigation[117] was carried out for atom typing by atomic electrostatic potential rather than atomic multipole moments as in the previous study. A retinal-lysine system was considered, a prominent feature in the mechanism of bacteriorhodopsin. This study focused on the aldehyde and terminal amino groups of retinal and ly- sine, respectively. The electrostatic potentials generated by these groups occurring in the full system were compared with those of smaller derivatives of the system. The electrostatic potential of lysine surrounding the terminal amino group was relatively conserved for all derivatives in which two (methylenic) carbon atoms were maintained along the amino acid sidechain. However, the aldehyde group of retinal was more responsive to more distant environmental effects.

This work has recently been further developed, where the concept of a “horizon sphere” is proposed[118]. This sphere contains all the atoms that a given atom, at the sphere’s centre, “sees” in terms of their polarisation of the electron density on the central atom. An α-helical segment of the protein crambin was studied. The electrostatic energy was probed by consideration of the multipole moment expansion (up to rank L = 4) centred at a Cα. A new set of multipole moments for Cα was calculated for each structure dictated by the growing horizon sphere. 86 CHAPTER 2. EXPOSITION

The interaction energy between Cα and a set of probe atoms was evaluated, leading to the conclusion that formal convergence of this interaction energy is attained at a horizon sphere radius of roughly 12A.˚ More work is underway to scrutinise the validity and generality of this conclusion.

Crystallographers who strive for the generation of transferable atomic electron densities[119], find qualms with the reconstruction of molecular electron densi- ties from these QCT-derived atomic densities. This is due to the mismatch in interatomic surface topologies between transferred atoms. As such, they believe that it becomes very difficult to generate a continuous electron density from these atomic fragments. Work has therefore been directed towards the generation of pseudoatom databanks that may be utilised to reconstruct experimental electron densities from previously elucidated structures. From this approach follows a nat- ural output in the form of atomic multipole moments. It is important to point out that the aforementioned mismatch can be countered by accepting that the interaction energy between atoms is what ultimately matters, not the perfect con- struction of gapless sequences of topological atoms. With this premise in mind we have shown that the machine learning method kriging captures[120, 121], within reasonable energy error bars, the way a QCT atom changes its shape in response to a change in the positions of the surrounding atoms.

Jelsch et al.[122], for example, demonstrated the capacity of transferring exper- imental density parameters for small peptides, based upon the Hansen-Coppens formalism[123], and subsequently built a databank of pseudoatoms. The refine- ment of high resolution X-ray crystallographic data by referral to this databank has been demonstrated[124]. A more computationally-orientated route has been developed in parallel to the one above[119], whereby the experimental density pa- rameters for a set of pseudoatoms were derived from ab initio electron densities 2.3. MULTIPOLE MOMENT ELECTROSTATICS 87 of tripeptides. This method showed an enhanced amenability to transferability compared to its experimentally-derived counterpart.

More recently this pseudoatom database has been built upon[125]. Atom types were defined by grouping atoms with the same connectivity and bonding part- ners, while the atom type properties were defined by averaging over all constituent “training set” pseudoatoms. Single point calculations were initially carried out on a test set of amino acid derivatives at B3LYP/6-31G** level. The geometry of each species was taken directly from the Cambridge Structural Database(CSD)[126]. Multipole moment expansions for non-hydrogen atoms were truncated at ranks L = 4 and L = 2 for the hydrogens. These multipole moments were subsequently averaged and standard deviations defined for the dataset. In terms of perfor- mance, the databank model appears to give a slightly more pronounced electro- static potential surrounding oxygen atoms in carboxylate and hydroxyl groups of serine, leucine and glutamine, compared to the more extensive ab initio cal- culations. Further work showed that the databank does relatively well in the prediction of most atomic multipole moments. Exceptions take the form of higher order multipole moments, most particularly for oxygens and nitrogens. Finally, we mention that somewhat poorer results are obtained when considering total inter- molecular electrostatic energies in dimers. The errors are of the same magnitude as those obtained from AMBER99, CHARMM27 and MM3. The authors ascribe these results to the implementation of a Buckingham-type approximation, whereby non-overlapping electron densities are assumed. This results in the underestima- tion of short-range interactions. The authors report much-reduced discrepancies in these energies by use of their own refined method, which accounts for this discrepancy[127].

However, in spite of the issues of the reconstruction of structures by use 88 CHAPTER 2. EXPOSITION of QCT, the technique remains amenable to the elucidation of electrostatic prop- erties. Woi´nska and Dominiak[128] have given a thorough elaboration on the transferability of atomic multipole moments based on various density partitions, most notably directly comparing the Hansen-Coppens formalism to both QCT and Hirshfeld partitioning. In their study, multipole moments (up to L = 4) were as- signed to each atom in a set of biomolecular constructs, ranging from single amino acids to tripeptides. Atom types were subsequently defined from this molecule set based on criteria resembling those used by a similar study[129]. By averaging the multipole moments for given atom types in differing chemical environments, a standard deviation from this average value was obtained. A lower standard de- viation is indicative of a high degree of transferability, and vice versa. A QCT analysis of ab initio wavefunctions results in highly non-transferable lower-order multipole moments (L = 0, 1, 2). Secondly, for the higher-order Hansen-Coppens multipole moments (L = 3, 4) are particularly unstable. The atom types found to be non-transferable from the QCT analysis are generally carbons connected to two electronegative atoms (oxygen or nitrogen), or members of aromatic systems. The decline in the level of transferability for higher-order multipole moments for the Hansen-Coppens method is relatively widespread throughout atom types, most prominently carbon and nitrogen. It is, however, strange to note that for both of these points, the poor level of transferability for QCT and Hansen-Coppens pseudoatoms is constrained to specific atom types; for the rest, these techniques generally give rise to the most transferable multipole moments.

Whilst the lower-order QCT multipole moments are largely non-transferable, they tend to be far more stable when derived from crystallographic data. In spite of this, they are still the least transferable multipole moments in the set, with both Hirshfeld partitioning and the Hansen-Coppens pseudoatom formalism yielding somewhat more stable multipole moments. The authors conclude that the most 2.3. MULTIPOLE MOMENT ELECTROSTATICS 89 transferable multipole moments result from Hirshfeld partitioning. QCT discretely partitions electron density into distinct basins and so is vulnerable to numerical is- sues when undertaking mathematical operations such as integration over the basin. The Hansen-Coppens formalism, on the other hand, suffers from problems with localisation: distant electron density may be incorrectly assigned to a given nu- cleus. However, an exhaustive study of standard deviations from average multipole moments for given partitioning methods does little to confirm the dominance of one scheme over another. Transferability matters little if the multipole moments in question are incorrectly defined; their subsequent variances over a dataset are inconsequential. In fact, the difficulty in assigning transferable multipole moments to given atom types may equally be indicative of poor atom type definition, or the sheer inability to define a transferable atom type in terms of multipole moments with any great stability. We make a final note in that the atom types defined in this work have been tailored for pseudoatom usage[130, 131], and so may not be useable with a discrete partitioning scheme (QCT), relative to the so-called ‘fuzzy’ decompositions (Hirshfeld and Hansen-Coppens). In fact, this is concomitant with an analysis of dimer energies obtained from these three techniques[132]. When one uses multipole moments obtained by a QCT decomposition as opposed to a Hir- shfeld partitioning, the electrostatic energy obtained more closely resembles that obtained from a Morokuma-Ziegler energy decomposition scheme, by as little as 10%.

Simulation

A recent tour de force regarding the feasibility of biomolecular simulation have seen the computation of time trajectories of systems such as the ribosome[133]. However, this study implemented techniques not optimised for the output of par- 90 CHAPTER 2. EXPOSITION ticularly accurate results, using a highly parallelisable CHARM++ interface in conjunction with the AMBER force field and the NAMD molecular dynamics package[134], i.e. a partial charge approximation.

Biomolecular simulation requires the implementation of periodic boundary condi- tions to accurately model the environment in which a system resides. Moreover, the electrostatic energy of a system is slowly convergent. Many solutions to this problem have been proposed over the years[135, 48]. It should be noted that the interaction involving ‘higher order’ multipole moments (L ≥ 1) is more short-range than that between monopole moments. As such, the problem of slowly convergent long-range interactions is shared by both isotropic point-charges and multipolar electrostatics because the latter encompass point-charges.

We wish to raise the issue of the conformational dependence of electronic proper- ties. In reality, this problem is not unique to higher order multipole moments. Conventional force fields, which employ partial charges, choose to reside in a pseudo-reality of an invariable electrostatic representation, and so rarely encounter conformational dependencies. It is rather more difficult to simply ignore the obvi- ous reality of molecules as flexible entities when using multipole moments, and has been emphasised in analyses using both DMA and CAMM algorithms[136, 137]. Use of the electrostatic properties for one conformation correctly reproduces its corresponding electrostatic potential. However, use of this parameterisation in an alternative conformation results in highly unfavourable energies. It should be noted that this insufficiency is equally prominent when using a conserved set of partial charges between conformers. As such, since the issue of flexible molecules is a computational complexity that pervades all electrostatic approximations, it would be unfair to regard this as an additional burden specifically for multipole moments. Instead, it is a hurdle that both techniques are required to overcome in 2.3. MULTIPOLE MOMENT ELECTROSTATICS 91 enhancing the accuracy of simulation.

Evaluating multipole moments as a function of a conformational parameter is ap- pealing. For example, the multipole moments for both atoms in carbon monoxide may be described analytically as a function of the interatomic distance in the molecule[138]. However, scaling this idea up to systems with far more conforma- tional degrees of freedom, such as an amino acid, is an appreciably more difficult task.

An “analytical compromise” has been proposed in the past[30], whereby the mul- tipole moments of an atom in ethanol, glycine and acrolein are represented by a Fourier series truncated at third order, whose free variables correspond to the dihedral angles of the molecule. Instead of analytical methods, machine learning methods can be used to interpolate between a set of multipole moments defined for different molecular conformations[121]. As such, one may then predict the multi- pole moments of an arbitrary conformation that is not present within the initial training set, which corresponds to a true external validation.

Alternative methods have been developed but the literature on these techniques appears to be relatively sparse. For example, it has been proposed[139] that one may average the atomic multipole moments over all conformers that are sampled during a simulation. This has been done for alanine and glycine by shifting the higher order (L ≥ 1) atomic multipole moment expansions to a smaller number of expansion sites distributed throughout the molecule. An additional method, previously tested for energy minimisations of crystal structures, revolves around periodically recalculating the atomic multipole moments for the molecule[140]. This proved to give substantially better results than the implementation of then- current methodologies, particularly for systems whose structures are dictated by strong hydrogen bonding. 92 CHAPTER 2. EXPOSITION

Forces (and torques) must be calculated for the molecular translational and rota- tional degrees of freedom to be sampled during the course of a simulation. These may be formulated directly by first and second derivatives of interactions energies, by use of translational and rotational differential operators. If one considers a molecule as a rigid body, the individual atomic multipole moments of the molecule are invariant relative to their stationary local axis systems. As such, the derivative of the interaction energy between two molecular species is satisfied by the deriva- tive of the interaction tensor only. This has been demonstrated in the spherical tensor formalisms[141, 142] and its application (see e.g. [143]). A simulation pack- age that allows for this rigid-body approximation in conjunction with multipolar electrostatics exists[144] and is called DL MULTI.

The invariance of atomic multipole moments with respect to a local axis system no longer holds when abandoning the rigid-body approximation in favour of a realistic flexible-body protocol. As such, differentiation of the interaction energy function requires the derivatives of multipole moments in addition to the interac- tion tensor. Whilst this requires[145] a more involved series of calculations, it is still an attainable requirement. Somewhat more problematic is the fact that the local axis systems in which the atomic multipole moments are referenced evolve over the course of a simulation. Due to the flexible nature of the molecule, neigh- bouring atomic positions that make up a local axis system change with respect to time. This results in the subsequent net rotation of the atomic local frames. In the context of QCT and the machine learning method kriging, analytical forces can be calculated for “flexible, multipolar atoms”, although this is not trivial[146].

Literature quotations of CPU time differences between a molecular dynamics sim- ulation using point-charges versus multipole moments (for a given number of nanoseconds, of a given biomolecule with a given number of water molecules sur- 2.3. MULTIPOLE MOMENT ELECTROSTATICS 93 rounding it), are virtually non-existent. However, Sagui et al.[147] reported a representative ratio of 8.5 in favour of point-charge electrostatics (as implemented in AMBER 7) when most of the calculation is moved to the reciprocal space (via the PME method) with multipolar interactions up to hexadecapole-hexadecapole being included. The only way a point-charge model can ever match the accuracy of a nucleus-centred multipolar model is via the introduction of extra off-nuclear point charges. What is rarely stated is that these additional charges create an enor- mous computational overhead in a typical system of tens of thousands of atoms because charge-charge interactions are longer range (1/r-dependence) than any interaction between multipole moments.

AMOEBA

Arguably one of the most successful next-generation force fields is AMOEBA (Atomic Multipole Optimised Energetics for Biomolecular Applications)[26]. AMOEBA has been proven effective in a variety of biomolecular simulations, ranging from solvated ion systems[148, 149, 150] to organic molecules[151] and peptides[152, 153, 154]. The electrostatic energy component of the force field is broken down into two terms. The first term is concerned with permanent atomic multipole moments (expansions truncated at L = 2), whilst the second term corresponds to induced multipole moments as a result of polarisation effects. The permanent atomic multipole moments are generated by DMA of a set of small molecules such that atom types may be defined. When implemented during a simulation, these atomic multipole moments may be rotated into various fixed local axis systems within the molecule. AMOEBA proves to be competitive, even with ab initio level calculations. All levels of theory tested perform in a uniform manner: the average ∆E values across all conformations at the MP2/TQ, ωB97/LP, B3LYP/Q and 94 CHAPTER 2. EXPOSITION

AMOEBA levels of theory are 3.73, 3.15, 3.64 and 3.30 kcal mol-1, respectively. These are impressive values, but one must remain aware that they correspond to total energies as opposed to those arising specifically from the electrostatic com- ponent of the force field.

A true demonstration of the benefits corresponding to multipole moments arises from a direct comparison between AMOEBA and the various force fields that em- ploy point-charges. Kaminsky and Jensen[155], for example, sampled the number of energetic minima of glycine, alanine, serine and cysteine one recovers at MP2 level. The number of minima and their relative energies were subsequently com- pared to those recovered by use of AMOEBA and four other point-charge force fields. The results for serine and cysteine are outlined in Table 2.1, where the ab initio data suggests 39 and 47 minima, respectively. We see that AMOEBA consistently outperforms the large majority of point-charge force fields in terms of the mean absolute deviation (MAD) of energies relative to the MP2 results. AMOEBA additionally outperforms all other force fields in terms of the number of the minima it predicts for each amino acid. Note that the latter result gives rise to an artificially large MAD value relative to the other force fields. Considering the aforementioned more favourable MAD corresponding to AMOEBA, this only emphasises the predictive capacity of AMOEBA. 2.3. MULTIPOLE MOMENT ELECTROSTATICS 95

Serine

MM3 OPLS 2005 AMBER99 CHARMM27 AMOEBA Number of Correct Minima 15 23 21 19 34 Number of Erroneous Minima 3 3 7 2 7 Percentage Erroneous 20.0 13.0 33.3 10.5 20.6 MAD (kJ mol-1) 14.0 4.1 10.9 11.1 4.2 Cysteine

MM3 OPLS 2005 AMBER99 CHARMM27 AMOEBA Number of Correct Minima 21 25 29 21 44 Number of Erroneous Minima 3 3 5 5 1 Percentage Erroneous 14.3 12.0 17.2 23.8 2.3 MAD (kJ mol-1) 13.9 4.6 5.4 6.6 3.1

Table 2.1: The number of geometric minima predicted by a variety of force fields for serine and cysteine. Also given are the number of minima that the various force fields predicted but that were not represented in the set of minima generated by ab initio calculations, and the mean average deviations (MAD) for the molecular energies at each geometry.

Another study that has directly compared AMOEBA to a variety of other con- ventional force fields (AMBER, MM2, MM3, MMFF and OPLS) is that of Ras- mussen et al.[156], where relative conformational energies were approximated. A set of minima were generated for several molecules, each with intermediary elec- trostatic properties ranging from entirely non-polar to zwitterionic. The ability of the various force fields to predict relative energies of the minima was probed, in addition to three separate AMOEBA parameterisation schemes, differing in atoms typing or level of theory. All force fields performed extremely well for the non-polar molecules, largely due to the minor electrostatic contribution to the conformational energy of non-polar molecules. As such, the level at which electro- statics were calculated is essentially irrelevant. However, as the molecular species become more polar in nature, the point-charge force fields begin to display their erroneous nature relative to the AMOEBA parameterisations, which demonstrate 96 CHAPTER 2. EXPOSITION a more uniform predictive capacity. The zwitterionic species were modelled well by several of the point-charge force fields. This can be explained by the fact that full charges are properly represented by a spherical electrostatic potential as the charge is highly localised. Thus, point-charge implementations of electrostatics can model such a case with relative ease.

The accurate reproduction of the properties of water has long plagued simulation. Being able to account for explicit binding of water molecules, in addition to ac- curate modelling of bulk properties such as the dielectric constant are necessary features if one wishes to accurately simulate solvated systems. AMOEBA attempts to account for the lack of a universally acceptable water model by specifically pa- rameterising the water molecule[157, 158]. Much like the generic AMOEBA force field, atomic multipole moment expansions up to L = 2 are generated using DMA. Polarisation is accounted for via induced atomic dipoles, and van der Waals in- teractions are modelled by a buffered 14-7 LJ potential. To the credit of the developers, this model is continually improved upon and reparameterised. Most recently[159], atomic multipole moments were generated for a water model at MP2 level with various basis sets in order to probe the reproduction of hydration free energies for a set of small molecules. Whilst the aug-cc-pVTZ basis set was found to give the best results, 6-311++G(2d,2p) gave a comparable accuracy at a much lower computational cost, and so is recommended for larger simulations.

A direct comparison between an AMOEBA water model parameterised at MP2/6- 311++G(2d,2p) level and a widely used point-charge water model reveals the ben- efits of atomic multipole moment-parameterised electrostatics. A popular choice for explicit solvation is the TIP3P model, which assigns a single point-charge to each atomic centre and implements a 12-6 LJ function. Since the LJ functions dif- fer between the two models, a “TIP3P-like” model was generated, which used the 2.4. MACHINE LEARNING 97

AMOEBA water model, but removed all multipole moments (static and induced), replacing them with point-charges. TIP3P and TIP3P-like models were shown to be equivalent by comparison of RDFs and bulk simulation properties. Deviation of the computed hydration free energies from experimental benchmarks for a set of small molecules are given in Table 2.2 for both AMOEBA and TIP3P-like mod- els. AMOEBA outperforms the TIP3P-like model quite spectacularly, where the RMSD of the AMOEBA values is roughly three times smaller than the RMSD of the TIP3P-like values.

Ethylbenzene p-cresol Isopropanol Imidazole Acetic Acid RMSD AMOEBA -0.73 -7.27 -5.58 -10.11 -5.69 0.68 TIP3P-like -0.89 -10.72 -5.29 -10.87 -7.46 1.96 Experiment -0.70 -6.10 -4.70 -9.60 -6.70 -

Table 2.2: Free energies of hydration for a variety of molecules predicted by use of AMOEBA and TIP3P-like water potentials. Corresponding experimental values are also given. Energies are in kcal mol-1.

2.4 Machine Learning

2.4.1 Overview

We have explored the means one may employ to obtain a number of ab initio atomic properties that can be used for force field construction. An area that is entirely overlooked by the classical force field community is that these quantities are not static with respect to the internal degrees of freedom of a system10. The 10Indeed, these quantities are not static with respect to the external degrees of freedom if placed in some anisotropic medium, but a discussion on external degrees of freedom is not undertaken 98 CHAPTER 2. EXPOSITION internal degrees of freedom of a molecular system are in a constant state of motion owing to inter- and intramolecular forces. When a molecule reconfigures itself, the associated molecular electron density, and consequently the atomic partitioning, instantaneously adjust. Since atomic properties are obtained from integrals over atomic domains, atomic properties must similarly alter with respect to a shift in molecular conformation.

The question then is how one can obtain a conformationally-dependent force field, without having to resort to obtaining analytical PESs (which limit the system size to a handful of atoms, e.g. [160]). Perhaps the simplest way to do so is to construct a database containing conformations and associated atomic/ molecular properties. Of course, if one were to construct an entire biomolecular force field in this way, the memory required to store these databases for a given accuracy would be over- whelming. However, such databases have been constructed for small molecular systems, and used for geometry optimisation[161] and structure refinement[162] with some degree of success.

An alternative method would be to use some analytical function of a number of molecular degrees of freedom, along which an atomic/ molecular property evolves. Fluctuating multipole models ascribe a multipole moment, or set of multipole mo- ments, to an atom, and allow these to vary in response to some external field[139]. This external field is an average description of perturbations resulting from the atomic environment, and so is computationally efficient. The fluctuating charge model[163] is one such popular model, where each atom is assigned a fluctuating monopole moment. Further work has been done to ascribe higher order fluctu- ating multipole moments to each atom[164, 165], but this is not a widely-used methodology. The issue with these models is the idealised notion of a molecular here for the sake of simplicity. 2.4. MACHINE LEARNING 99 perturbation being well-modelled by a single external field vector. In addition, the fluctuating multipole moments are typically constrained to a harmonic evolution, whereas in reality multipole moments possess a far more complex conformational dependence[166].

An attractive solution is some form of interpolative scheme, whereby atomic prop- erties for given molecular conformations are explicitly computed. Such points are referred to as “training points”. Collectively, the set of training points is referred to as a “training set”. An interpolative scheme can then be employed to predict atomic properties at “untrained” points on the molecular potential energy surface.

Machine learning is a field that is suited to such a task. Machine learning is a broad field of mathematics, dealing with the construction of algorithms that can “learn” mappings from a given input to a response by use of a training set. Machine learning has gained some popularity in being used to predict atomic/ molecular energies for given conformations. We make a brief review of the more popular im- plementations, but are necessarily brief, and omit a number of methodologies that are deserving of attention if space wasn’t a constraining factor (see, for example, [167]).

Perhaps the most popular machine learning techniques are the Artificial Neural Networks (ANNs), a set of massively parallel interconnected nodes that relate some input vector to a response function[168], the inspiration of which is taken from the functioning of biological neuronal circuits. A number of ANNs have been developed over the years[169, 170]. To date, only feed-forward neural networks, also referred to as multilayer perceptrons, have been used in the construction of a force field at the time of writing[9]. ANNs have been applied to a wide range of problems, such as the calculation of energies of systems in the solid state[171, 172], surface adsorption, water[173] and interatomic potentials[174]. 100 CHAPTER 2. EXPOSITION

Past work within our group[175, 176] has employed ANNs for the prediction of atomic multipole moments. Some success was attained, but ANNs notoriously de- teriorate in quality in high-dimensional spaces[177]. This is problematic since the number of conformational degrees of freedom required for the accurate modelling of conformational dependence exceeds the dimensionality for which ANNs are vi- able. In addition, ANNs suffer from the phenomenon of “overtraining”, where the number of free parameters used in the construction of the ANN makes its predictive capacity deteriorate[178].

Gaussian Process Regression[179] is a second popular machine learning technique. In this framework, it is assumed that the best estimate for an atomic property in a given molecular conformation can be expanded in Gaussian basis functions[180]. Gaussian Approximation Potentials[16, 181] have been used to this end. Complex descriptors, such as symmetry functions[182, 183], are employed in the construction of a conformational basis set that is invariant with respect to global translations, rotations, permutations of atoms, and a number of other occurrences, which render these potentials cumbersome[184].

In the remainder of this section, we describe an alternative Gaussian Process Re- gression (GPR) machine learning technique, referred to as kriging. Kriging finds its origins in the field of geostatistics[185], where mineral concentrations were pre- dicted at arbitrary sites. Kriging has gone on to be used in a number of other disciplines, such as real estate appraisal[186], the design of integrated circuits[187] and environmental science[188]. We have found kriging to be an ideal choice for the construction of a conformationally dependent force field since it suffers from neither overtraining or deterioration in high-dimensional spaces, both of which plague the implementation of ANNs. 2.4. MACHINE LEARNING 101

2.4.2 Kriging

In the following, we adopt the convention of enumerating elements of a set with a superscript, and components of a vector or matrix with a subscript. We define n o a set of n training points, X := x1, ... , xn , where each training point x is a h i> d−dimensional vector, x1 ... xd . For each training point, we can evaluate a response function, y = f(x), and collect these responses into an n−dimensional h i> vector, y = y1 ... yn for convenience. Kriging is a machine learning method 0 that allows one to predict the value of the response at untrained points, fpred(x ), where x0 ∈/ X, as a function of the elements of X.

Kriging is strictly interpolative, which means that the prediction of the response function at trained points, i.e. fpred(x), x ∈ X, is exact. At untrained points, it can be shown that kriging is the best linear unbiased predictor with respect to the minimisation of the sum of squared residuals[189]. The term unbiased means that

0 the expectation value of predictions, hfpred(x )ix0∈/X, is equal to the expectation value of the response function, i.e. the first moment, or mean, of the distribution f(x). This definition of an unbiased predictor is used as a constraint placed on the kriging estimator, and features as a Lagrange multiplier in the general derivation for the kriging estimator[189].

The key observation of kriging is that the responses, (i.e. the response function evaluations), at two distinct points, f(xi) and f(xj), are similar if xi and xj are close together in space, if the response function is sufficiently well-conditioned. We can summarise the correlation between training points in an n × n symmetric, positive semidefinite matrix, R, whose ijth element corresponds to the correlation between points xi and xj. The correlation function, or kernel, we use for this work 102 CHAPTER 2. EXPOSITION is a power correlation function[179], of the form

" d # X i j ph Rij = exp − θh xh − xh , (2.4.1) h=1 where h is an index running over the d degrees of freedom of points xi and xj, and θh and ph are the so-called hyperparameters associated with these degrees of freedom, and can be accumulated into the d−dimensional vectors θ and p, respectively.

Rasmussen and Williams[179] have defined the θh hyperparameter as being equal to 1/2rh, rh being the “characteristic distance” of that particular feature. In this way, rh is a measure of the extent of the correlation among different points “along” that feature. Rasmussen and Williams showed the hyperparameters to be bounded as follows

0 ≤ θh ∀h = 1, ... , d . (2.4.2)

0 ≤ pk ≤ 2 ∀h = 1, ... , d

i j Analysis of (2.4.1) shows that, in the case where x and x coincide, Rij = 1, whilst in the limit where the two points are infinitely removed from one another, Rij → 0.

In addition, the hyperparameters {θ, p} possess physical significance. When θh is small, the influence of the hth degree of freedom is enhanced, leading to an increase

th in correlation between two points. Similarly, if θh is large, the influence of the h degree of freedom is diminished, leading to a decrease in correlation between the two points. As such, we can conclude that θh is a measure for the “importance” of a degree of freedom on the correlation between points. The interpretation of the components of p is a little more abstract, but are typically referred to as the

“smoothness” of the process. We see that when ph = 2, the correlation between points takes a Gaussian form, so that as two points become more distant, their correlation dies off as a Gaussian. If ph = 1, the correlation between two points 2.4. MACHINE LEARNING 103

dies off as a simple exponential. Owing to the boundary conditions in (2.4.2), ph can take on real values, the physical significance of which is not entirely clear.

To obtain optimal predictions by means of kriging, we require that the log-likelihood function, L(θ, p, σ2, µ) is maximised[190]. The log-likelihood function has the form n 1 (y − 1µ)>R−1(y − 1µ) L(θ, p, σ2, µ) = − ln(σ2) − ln(|R|) − , (2.4.3) 2 2 2σ2 where 1 is a column vector of ones, µ and σ2 the first and second moments, re- spectively, of the response function and |R| the conventional notation for a matrix determinant. We can estimate µ and σ from the available data by setting the derivative of (2.4.3) with respect to µ and σ2 equal to zero. Analytical forms for the optimal values of µ and σ2, which we denoteµ ˆ andσ ˆ2, respecitvely, and are found to be[191] 1>R−1y (y − 1µ)>R−1(y − 1µ) µˆ = σˆ2 = . (2.4.4) 1>R−11 n Substitution of (2.4.4) into (2.4.3) yields the concentrated log-likelihood function, Lˆ(θ, p), where explicit dependence onµ ˆ andσ ˆ2 is implied through dependence on θ and p. The concentrated log-likelihood function has the form n 1 Lˆ(θ, p) = − ln(ˆσ2) − ln(|R|) . (2.4.5) 2 2 We then see that maximisation of the concentrated log-likelihood function requires maximisation with respect to {θ, p}. A number of strategies are available for this purpose, both analytical and numerical, yielding optimal estimations for θˆ and pˆ. One such optimisation strategy is discussed in Section 2.4.3. We do, however, point out that any optimisation procedure requires iterative calculations of |R|, which is not trivial computationally when n is large. As such, maximisation algorithms requiring as few iterations as possible will naturally be desirable in order to circumvent the large computational costs associated with evaluating the matrix determinant. 104 CHAPTER 2. EXPOSITION

Now, in order that predictions can be made at some arbitrary untrained point, x∗,  > we compute an n−dimensional vector, r = R1∗ ... Rn∗ , the components of i ∗ which, Ri∗, correspond to the correlation between x and x , as given by (2.4.1). Defining an augmented (n + 1) × (n + 1) correlation matrix, R˜ , as   R r ˜   R =   , (2.4.6) r> 1 we can rewrite the log-likelihood function, (2.4.3), along with out optimalµ ˆ and σˆ2,(2.4.4), as

n 1 (y˜ − 1µˆ)>R˜ −1(y˜ − 1µˆ) L(θ, p, σˆ2, µˆ) = − ln(ˆσ2) − ln(|R˜ |) − , (2.4.7) 2 2 2ˆσ2  > where we have introduced the augmented response vector, y˜ = y1 ... yn y∗ , where y∗ is the response at the point x∗, i.e. the value we wish to predict. Only the final term in (2.4.7) is dependent upon our augmentation, and so writing this out explicitly  >  −1   y − 1µˆ R r y − 1µˆ             ∗ > ∗ (y˜ − 1µˆ)>R˜ −1(y˜ − 1µˆ) y − µˆ r 1 y − µˆ = . (2.4.8) 2ˆσ2 2ˆσ2 R˜ −1 can be given explicitly by use of the partitioned inverse formula from Thiele[192], which subsequently allows us to write the augmented log-likelihood function as n 1 L˜(θ, p, σˆ2, µˆ) = − log(ˆσ2) − log(|R|) 2 2 (2.4.9)  (y∗ − µˆ)2   r>R−1(y − 1µˆ)  − + (y∗ − µˆ) . 2ˆσ2(1 − r>R−1r) σˆ2(1 − r>R−1r) Finding the derivative of this quantity with respect to y∗ and equating the deriva- tive to zero results in the maximisation of the augmented log-likelihood function,  (y∗ − µˆ)   r>R−1(y − 1µˆ)  − + = 0 , (2.4.10) 2ˆσ2(1 − r>R−1r) σˆ2(1 − r>R−1r) 2.4. MACHINE LEARNING 105 and solving for y∗ yields

yˆ∗ =µ ˆ + r>R−1(y − 1µˆ) , (2.4.11) the kriging estimator. Naturally, (2.4.11) can be used for any arbitrary point x∗[193, 194]. We note that making a prediction requires the evaluation of R−1, which can be a computationally demanding process if n is large.

We conclude by introducing two informative metrics to assess the quality of a kriging model. The first is the condition number[195] of the correlation matrix, κ(R), given by   ||R|| ||R−1|| if |R|= 6 0 κ(R) = , (2.4.12)  ∞ if |R| = 0 where ||R|| corresponds to the operator norm of R. If κ(R) = 1, then the approx- imation to a solution set introduces no larger errors than those present within the input set. An ill-conditioned matrix, i.e. one whose condition number is large, typically leads to numerical instabilities, and is naturally problematic. It has been found that a small spacing between training points leads to an increase in the condition number of R, which results in a deterioration in the predictive capacity of the machine learning model[196]. We address this issue in significant detail in Chapter5.

The second useful metric is the mean squared error (MSE) at a prediction point, s2(x∗), given by " # (1 − 1>R−1r)2 s2(x∗) =σ ˆ2 1 − r>R−1r + . (2.4.13) 1>R−11

Since kriging is interpolative, s2(x∗) is equal to zero if x∗ is a training point. When x∗ is not a training point, the mean squared error reflects the level of uncertainty in the prediction of y∗, i.e. the confidence we have in our prediction. The mean 106 CHAPTER 2. EXPOSITION squared error is brought down by adding more training points in the vicinity of a prediction point, and so can be used to ascertain whether a greater training density is required in a given region.

2.4.3 Particle Swarm Optimisation

Particle Swarm Optimisation (PSO) is an optimisation strategy which draws from the dynamics of swarms in nature[197, 198]. Each member of the swarm is aware of its current position and is free to move in any way it pleases, but also appears to be aware of some global swarm properties. This latter knowledge gives the impression that the swarm moves as a single entity, in spite of the members of the swarm being free to move independently of the other members. PSO utilises these swarm dynamics as a means to locate the global optimum of a function; particle positions take the form of the degrees of freedom of the function, sampling the function at these points. The swarm then acts collectively to converge upon the global optimum of the function, by moving towards those positions where the function appears to be optimal.

For the sake of simplicity, we will combine the vectors θ and p into a single 2d-dimensional vector, z, which denotes a point in {θ, p}-space. With PSO, we consider an ensemble of Nswarm particles, each one characterised by a position vector zi. Each particle is free to move in discrete time, t, subject to

zi(t + 1) = zi(t) + vi(t + 1) , (2.4.14)

th where vi(t + 1) is the “velocity” of the i particle. The velocity has functional 2.4. MACHINE LEARNING 107 form11

vi(t + 1) = ωvi(t)

+ c1U 1(0, 1) ◦ (bi(t) − zi(t)) (2.4.15)

+ c2U 2(0, 1) ◦ (gi(t) − zi(t)) , and we discuss each term individually. ω is referred to as the “inertia” of the particle (conventionally set to 0.729), and dictates the level to which the velocity at time t contributes to the velocity at time t + 1. c1 is the “cognitive learning factor” (set to 1.494 in our work), U 1(0, 1) is a vector of uniform random variates, th and bi(t) is the “private guide”, corresponding to the best position the i particle has visited along its trajectory, allowing us to interpret this term as a driving force towards this previous best position. c2 is the “social learning factor” (again set to 1.494), U 2(0, 1) is a vector of uniform random variates, and gi(t) is the “global guide”, corresponding to the best position over all particles in the swarm along all Nswarm trajectories, this term therefore acting as a driving force towards that position.

The dynamics outlined above are iterated for each particle in the swarm. Because of the social learning factor, the particle trajectories are drawn to the same area. The private guide allows for a simultaneous exploration of the particle’s local environment. As time progresses, the particles converge upon the same point. The optimisation is considered complete when all particles occupy the same point, to within some numerical tolerance.

A number of modifications to the PSO algorithm exist; adaptive PSO[199], the swarm and queen algorithm[200], hybrid PSO[201] (where particle multiplication is permitted), etc. However, we have found the conventional PSO algorithm to suffi-

11In this equation, we use x ◦ y to denote the Hadamard product between two vectors, and h i> yields the vector x ◦ y = x1y1 x2y2 ... x2dy2d . 108 CHAPTER 2. EXPOSITION ciently optimise θ and p. This has been verified within our group by comparison to a number of alternative optimisation strategies, such as differential evolution[202] and analytical optimisation[203].

2.5 The Quantum Chemical Topological Force Field

2.5.1 Previous Work

The first work towards the construction of QCTFF verified that the multipole moments, as computed through a QCT partitioning, reproduced the electrostat- ics of a system over the course of a MD simulation. Studies on both hydrogen fluoride[204] and water[205] clusters were undertaken to this end. With the water clusters, it was found that the density and potential energy were roughly 0.1% and 3%, respectively, off their experimental values, when multipole moment elec- trostatics were used. The spatial distribution functions were also found to be in good agreement with experiment, with correlation coefficients above 0.97.

In constructing a conformationally-dependent force field, our group began by using ANNs. The first work on this matter was that of Houlding et al.[176], where mul- tipole moments, up to the hexadecapole moments, were predicted for the hydrogen fluoride dimer at a number of MD snapshot geometries. This work was followed by that of Handley et al.[175], where the same approach was taken for water clusters of increasing sizes (up to the hexamer). In the final such study, Darley et al.[206] applied this methodology to N-methyl acetamide and glycine, demonstrating that the approach was valid for biomolecular systems. 2.5. THE QUANTUM CHEMICAL TOPOLOGICAL FORCE FIELD 109

For reasons detailed in Section 2.4, the group subsequently transitioned from the use of ANNs to kriging, demonstrating that the errors associated with kriging were lower relative to those of ANNs for water[207], ethanol[208] and alanine[120]. Hydrated sodium ions[209] have also been treated in the same manner. The elec- trostatic energy of the system over a number of conformations was predicted to an accuracy of 4 kJ mol-1.

The first MD simulations carried out using the kriging models studied aqueous imidazole[210]. The solution density, diffusion coefficients, radial and spatial dis- tribution functions were used to evaluate performance relative to AMBER. The solution density was well-recovered by the QCTFF implementation up to high con- centrations, in contrast to AMBER, where the solution density was consistently underestimated relative to experiment. A similar result was found with the diffu- sion coefficients. The spatial and radial distributions also suggested a completely different ratio of imidazole stacked to chain conformers relative to the AMBER implementation. This work was later built upon in a study on liquid water over a range of pressures and temperatures[211]. Bulk properties were found to be well-recovered relative to point charge water potentials.

More recently, work has focused on reducing the errors attributable to our machine learning models by the refinement of in-house software[121]. As the systems stud- ied become more complex, the workflow associated with the generation of kriging models requires streamlining so that the resultant kriging models are both highly accurate and small (vis-`a-vis memory requirements) as possible[212, 213]. In addi- tion, work has also shifted from the generation of kriging models for electrostatics to kriging models for the IQA components outlined in Section 2.2.3[214, 215]. In this way, we are able to forsake the empirical potentials employed in classical force fields for a fully quantum mechanical treatment of molecular energetics. 110 CHAPTER 2. EXPOSITION

Work has also been undertaken on a number of conceptual issues that require resolution before the QCTFF becomes a commercially viable product[216]. For instance, the construction of a force field requires some concept of atom typing. In defining an atom type, one is able to use the same parameter set (or kriging models in our case) across a range of atoms within a system. In so doing, one is able to circumvent parameterising new models for every atom within a system, which is simply not feasible. The idea of an atom type is inextricably linked to that of transferability, which we have discussed in Section 2.3.2. A number of studies have been undertaken within our group to this end in an effort to determine the atom types required for the QCTFF to be generally applicable[217, 218, 219].

2.5.2 Proposed Implementation

Our aim is to construct a conformationally-dependent force field, whose energetic terms derive from an IQA treatment of a molecular system. The conformational dependence is accounted for by the construction of a kriging model, that can predict an IQA energy component for untrained molecular conformations. This project has been named the “Quantum Chemical Topological Force Field” in the literature.

Through previous work in our group, we have found that constructing individual kriging models for each of the components outlined in (2.2.14) yields undesirably large prediction errors. Instead, we have found that the use of the four following terms is ideal,

A A X AB X AB X AB EIQA = Eself + Vxc + VSRC + VLRC . (2.5.1) B B1−4 B1−5

A Eself is the “self-energy” of atom A, comprising the kinetic, electron-nuclear at- 2.5. THE QUANTUM CHEMICAL TOPOLOGICAL FORCE FIELD 111 traction and electron-electron repulsion energies,

A A AA AA Eself = T + Ven + Vee (2.5.2) A AA AA AA AA = T + Ven + Vee,coul + Vee,exch + Vee,corr .

AB Vxc is the inter-atomic exchange-correlation energy, and is given by

AB AB AB Vxc = Vee,exch + Vee,corr . (2.5.3)

AB AB The terms VSRC and VLRC correspond to the short- and long-ranged classical en- ergies, respectively. The long-ranged classical energy takes the form

AB AB AB AB AB VLRC = Vee,coul + Ven + Vne + Vnn B ∈ 1-5 and higher , (2.5.4) which can be expanded in multipole moments. As such, we require the constraint that the interaction between A and B be 1-5 and higher, otherwise the multipole moment interaction will diverge, as discussed in Section 2.3.1. We are then required

AB to express VSRC explicitly as

AB AB AB AB AB VSRC = Vee,coul + Ven + Vne + Vnn B ∈ 1-4 and lower . (2.5.5)

Through all of these terms, we find that we need three separate kriging models for EA , P V AB, P V AB , and a further twenty five separate kriging models for self B xc B1−4 SRC multipole moments (up to the hexadecapole moment) of P V AB . Therefore, B1−5 LRC we require twenty eight individual kriging models for every atom in the system. Chapter 3

Conformational Sampling

3.1 Introduction

Our aim is the construction of a kriging model that can be used over the course of a MD trajectory to make reliable predictions. Accordingly, we require train- ing points that span a domain of conformational space which is available to the system under the simulation conditions. Concurrently, some regions of the afore- mentioned conformational domain are difficult to model. These regions require a greater sampling density to capture the undulant features of the PES. We are therefore required to develop a conformational sampling methodology that satisfies this twofold precondition.

The conformational sampling of small molecular species has historically been un- dertaken by some form of systematic variation of a small number of molecular degrees of freedom, for instance, torsional angles[220, 221, 222]. These system- atic techniques were later adapted to vary a small number of degrees of free-

112 3.1. INTRODUCTION 113 dom, whether these be internal[223, 224, 225] or Cartesian[226, 227] coordinates, stochastically, by typically invoking some form of Metropolis Monte Carlo scheme[228]. However, for any molecular species containing more than 12 rotatable bonds, sam- pling a subset of the molecular degrees of freedom, systematically or stochastically, is not a feasible strategy for the generation of a physically realistic conformational ensemble.

Another strategy involves invoking some form of classical MD[229, 230], and se- lecting relevant conformers based on either;

1. The lowest energy conformers as determined by some classical force field;

2. The output of a conformer at regular intervals along a MD trajectory;

3. Maximising the conformational diversity of the conformers by analysis of some group of degrees of freedom.

A number of statistical tools can be used to demonstrate the inadequacy of MD as a means for conformational sampling. Essential dynamics[231], a form of principal component analysis, is one such prevalent tool. Past work has shown that the conformational flexibility of a protein is attributable to a small number of low frequency, delocalised modes of vibration[232]. Indeed, there is a large literature on using frequency analysis to decompose a MD trajectory into a small number of vibrational degrees of freedom, see for example [233, 234, 235, 236]. However, the finite time scale of a MD simulation does not allow for a proper sampling of these modes, as verified from an essential dynamics trajectory analysis[237].

Large scale conformational changes occur over relatively long timescales. A com- plete exploration of conformational space with MD, therefore, takes a signifi- cant amount of time and computational resources. A number of developments 114 CHAPTER 3. CONFORMATIONAL SAMPLING have, however, made large scale conformational sampling feasible. The first of these is the RESPA (Reversible rEference System Propagator Algorithm), an in- tegration scheme which follows from a Liouville operator treatment of Newtonian dynamics[238]. The RESPA allows for a number of dynamical timesteps to be used over the course of a MD trajectory; the slower degrees of freedom are integrated with longer timesteps than their higher frequency counterparts. RESPA then re- duces the computational complexity of the problem, and subsequently allows for longer MD trajectories to be computed. It has even been shown that certain biomolecular systems are amenable to outer timesteps of the order of 100fs[239].

If the PES features a number of low energy regions separated from one another by high energy transition barriers, then even the RESPA is rarely adequate for a com- plete sampling of conformational space. The timescales required for a trajectory to surmount these high energy transition barriers are typically of the order of mil- liseconds, and transitions are referred to as “rare events”[240]. Metadynamics[241] and high temperature MD[242] have been used to accelerate the transitioning to higher energy regions of the PES, thereby accelerating conformational sampling. However, results have shown the technique to be susceptible to error owing to the somewhat arbitrary choice of enhancement parameters, such as the choice of the high temperature for transition[243]. It has also been shown that the confor- mational regions explored using high temperature MD do not overlap with some systematic conformational search methodologies[244].

An interesting methodology, which is somewhat similar to the conformational sam- pling strategy that we will outline in this chapter, has proposed a “mode-following” protocol[245, 246]. With this, an arbitrary initial molecular conformation is dis- cretely evolved along one or a subset of its lowest frequency normal modes. Note that the molecular Hessian is not updated along the trajectory. Periodically, the 3.1. INTRODUCTION 115 system is minimised by use of a classical force field and analysed.

It is not the sampling strategies that we are adversed to. Indeed, there exist ele- gant algorithms, such as the Wang-Landau algorithm[247] or the nested sampling algorithm[248] for conformational sampling[249, 250, 251]. However, each method we have outlined utilises a classical force field or some other empirical potential. It would be unsightly to initiate the construction of our force field with the very methodology that we are trying to supersede. Aside from this conceptual concern, conventional force fields are notorious for significant disagreement in virtually ev- ery molecular property[252, 151]. The choice of force field would then impact on the training of our own force field. Typifying this criticism, a recent article has shown the disparity between the popular classical force fields in predicting the conformational preferences of a number of small proteins[253].

In the following, we present a method with which rigorous conformational sam- pling can be undertaken for the purpose of constructing kriging training sets. We approximate the ab initio PES by tessellating with a set of second order Taylor expansions at a number of distinct seeding geometries. We refer to these second order Taylor expansions as “Local Wells” (LWs) since they are quadratic in the molecular energy. Stochastically hopping between these LWs, followed by vibra- tion within the LW, allows for dynamics to be undertaken on this approximate “tessellated” PES. 116 CHAPTER 3. CONFORMATIONAL SAMPLING 3.2 Stationary Point Vibrations

3.2.1 Kinetic Energy

For a molecular system with N atoms, we define a molecular Cartesian state vector,  > x = x1 x2 ... x3N , and introduce a difference coordinate, ∆x, relative to  > some arbitrary configuration, x∗ = ∗ ∗ ∗ , such that x1 x2 ... x3N

 > ∆x = x − x∗ = ∗ ∗ ∗ x1 − x1 x2 − x2 ... x3N − x3N (3.2.1)  > = ∆x1 ∆x2 ... ∆x3N .

In the following, we will use Greek indices to enumerate atoms, and Latin indices to enumerate degrees of freedom. Constraining x∗ to be invariant, we can use ∆x as a conventional state vector in the following derivation. The classical expression for the kinetic energy of a system is then given by

3N  2 X mi d T (∆x ˙ , ... , ∆x ˙ ) = ∆x (3.2.2) 1 3N 2 dt i i=1 N ! X mα = ∆x˙ · ∆x˙ , 2 α α α=1 where we have given an example of the formulation in terms of individual degrees of freedom and in terms of atomic position vectors, ∆xα. It will be convenient √ to introduce a set of mass-weighted Cartesian coordinates, qi = ∆xi mi, so that (3.2.2) becomes

3N 2 3N 1 X  d  1 X T (q ˙ , ... , q˙ ) = q = q˙2 . (3.2.3) 1 3N 2 dt i 2 i i=1 i=1 3.2. STATIONARY POINT VIBRATIONS 117

3.2.2 Potential Energy

The potential energy corresponding to a state, V (x) = V (x1, ... , x3N ) is given by a Taylor series about the previously introduced state x∗, leading to 3N ∗ ∗ X ∗ ∂V 2V (x1, ... , x3N ) = 2V0(x1, ... , x3N ) + 2 (xi − xi ) ∂xi i=1 x∗ 3N 2 X ∗ ∗ ∂ V + (xi − xi )(xj − xj ) + ... . ∂xi∂xj i,j=1 x∗ (3.2.4) The derivative factors are constant owing to the evaluation condition. Introducing some shorthand notation for the derivatives1, 2 ∂V ∂ V = Ji , = Hij , (3.2.5) ∂xi x∗ ∂xi∂xj x such that 3N ∗ ∗ X ∗ 2V (x1, ... , x3N ) = 2V0(x1, ... , x3N ) + 2 (xi − xi )Ji i=1 (3.2.6) 3N X ∗ ∗ + (xi − xi )(xj − xj )Hij + ... . i,j=1 The first order and second order spatial derivatives of the potential energy corre- spond to elements of the Jacobian and Hessian, respectively. By choosing x∗ such that it represents a stationary point on the potential energy surface, we are free to set V (x∗) = 0. The term involving the Jacobian goes to zero at this point, by definition. Omitting all terms higher than second order, we obtain a harmonic approximation to the potential energy 3N X ∗ ∗ 2V (x1, ... , x3N ) ≈ (xi − xi )(xj − xj )Hij . (3.2.7) i,j=1

1The Jacobian, J, is defined as the matrix of all first-order partial derivatives of a function, f : Ra → Rb, with respect to those degrees of freedom over which f is defined. Taking the case h i> of a = 1, we see that J takes the form of ∂f/∂x1 ... ∂f/∂xa , which is the form used here. Of course, Jacobian is equivalent to the gradient of a scalar field, ∇f. 118 CHAPTER 3. CONFORMATIONAL SAMPLING

It is useful to express the potential energy in the same coordinates as those used for the kinetic energy. This can be achieved by using the definition of our mass- weighted Cartesian coordinates, leading to the relation ∂ 1 ∂ 1 ∂ = √ ∗ = √ . ∂qi mi ∂(xi − xi ) mi ∂xi Modifying (3.2.7) to account for the above transformation, we obtain

3N 2 3N 2 X ∂ V X qiqj √ ∂ V 2V (x1, ... , x3N ) = ∆xi∆xj = √ mimj ∂xi∂xj mimj ∂qi∂qj i,j=1 x∗ i,j=1 q∗ 3N X = Hijqiqj = 2V (q1, ... , q3N ) , i,j=1 (3.2.8) where the Hessian is now expressed in mass-weighted Cartesian form, i.e.

2 2 1 ∂ V ∂ V Hij = √ = . mimj ∂xi∂xj x∗ ∂qi∂qj q∗

3.2.3 Equations of Motion

Given our expressions for the kinetic and potential energy, (3.2.2) and (3.2.8), respectively, we can evaluate the Euler-Lagrange equations of motion d ∂T ∂V + = 0 ∀k = 1, ... 3N (3.2.9) dt ∂q˙k ∂qk leading to

3N ! 3N ! d ∂ 1 X ∂ 1 X q˙2 + H q q dt ∂q˙ 2 i ∂q 2 ij i j k i=1 k i,j=1 3N ! 3N d 1 X ∂ 1 X ∂ = q˙2 + H q q dt 2 ∂q˙ i 2 ∂q ij i j i=1 k i,j=1 k 3N ! 3N   d X 1 X ∂qj ∂qi = δ q˙ + H q + q dt ik i 2 ij i ∂q j ∂q i=1 i,j=1 k k 3.2. STATIONARY POINT VIBRATIONS 119

3N d 1 X = q˙ + H (q δ + q δ ) dt k 2 ij i jk j ik i,j=1 3N 3N ! d2 1 X X = q + H q + H q dt2 k 2 ik i kj j i=1 j=1 3N d2 X = q + H q = 0 , (3.2.10) dt2 k ik i i=1 where we have invoked the symmetric nature of the Hessian (Hij = Hji) and the fact that the last two sums are identical since i, j are dummy indices.

We are now required to solve a second order homogeneous differential equation, the solution of which is a simple superposition of sinusoids of angular frequency ω

th and amplitudes Ak,Bk for the k equation of motion

qk(t) = Ak cos(ωt) + Bk sin(ωt) .

We choose to use the more compact notation of a single sinusoid with a phase factor, φ,

qk(t) = Ak cos(ωt + φ) . (3.2.11)

Substituting (3.2.11) into (3.2.10), we obtain 3N d2 X A cos(ωt + φ) + H A cos(ωt + φ) = 0 dt2 k ik i i=1 (3.2.12) 3N 2 X −ω Ak cos(ωt + φ) + HikAi cos(ωt + φ) = 0 . i=1 The next step involves the cancellation of the factor cos(ωt+φ) in each term. This cancellation places a constraint on the form of (3.2.11): if ωt + φ = (2n + 1)π/2, where n ∈ N, cancellation is not permitted since the cosine of this quantity is zero. However, prior to cancellation, we recover qk(t) = 0 from (3.2.12), and so the constraint is irrelevant. Continuing with non-zero cos(ωt + φ), we obtain 3N 3N 2 X X 2 − ω Ak + HikAi = 0 ∴ Ai(Hik − ω δik) = 0 . (3.2.13) i=1 i=1 120 CHAPTER 3. CONFORMATIONAL SAMPLING

This equation constitutes an eigensystem, for which there exist 3N values of ω yielding non-trivial solutions for the qk(t), i.e. where Ak 6= 0. The conventional way to solve the eigensystem is by formation of the secular determinant,

H − ω2 ... H 11 1,3N

...... = 0 . (3.2.14)

2 H3N,1 ... H3N,3N − ω Note that if the Hessian is in diagonal form, the determinant takes the form Q3N 2 i=1(Hii − ω ). Equality to zero follows from at least one term being equal to zero in the expansion, which is satisfied only if ω2 is equal to one of the diagonal elements of the Hessian. Therefore, if the Hessian is in diagonal form, the an- gular frequencies satisfying (3.2.13) are given by the square roots of the diagonal elements of the Hessian.

It is worth pointing out that if one of the diagonal elements of the diagonalised Hessian is less than zero, then the corresponding ω is imaginary. The resultant equation of motion, (3.2.11) then becomes

qk(t) = Ak cos(iωt + φ)

= Ak cos(iωt) cos(φ) − Ak sin(iωt) sin(φ)

= Ak cosh(ωt) cos(φ) − Ak sinh(ωt) sin(φ) , (3.2.15) which diverges after a certain amount of time. However, our timestep selection outlined in Section 3.4.2 guarantees that the equation of motion is never evolved for long enough to diverge. Imaginary frequencies are encountered at non-stationary point conformations on the PES. When we generalise our method to accommodate non-stationary point conformations, this point is worth bearing in mind.

Consider now a molecule under the influence of no external fields or field deriva- tives. If we rigidly shift the molecule in space, we find that the internal state of 3.3. GENERALISED VIBRATIONS 121 the molecule is not altered. There are then three basis vectors with no associated equations of motion. This behaviour is manifest in three of the angular frequencies being equal to zero, i.e. by use of (3.2.11), qk(t) = 0 for these three degrees of freedom. A further three degrees of freedom can be argued to be similarly time- invariant. Consider rigid rotations about the three Euler angles of the system. As with the global translational degrees of freedom, rigid rotations do not affect the internal state of the molecule, and so an appropriate coordinate system yields another three angular frequencies equal to zero.

The choice of coordinate system we have alluded to which renders six angular frequencies equal to zero are referred to as the normal coordinates. The most computationally amenable way to transform the Hessian into a normal coordinate basis is by constructing an appropriate transformation matrix, D : R3N → R3N . Note that the mapping to normal coordinates conserves the dimensionality of the space, an orthogonal direct sum of external and internal vector spaces, i.e. 3N 6 L 3N−6 3N−6 R = Rext Rint . It is the elements of Rint that represent the internal 6 degrees of freedom of the system. The elements of Rext correspond to global translational and rotational degrees of freedom, and there is no point to sampling within this space since all conformations are equivalent.

3.3 Generalised Vibrations

It is clear from the previous section that setting the spatial first derivative of the potential to zero simplifies the resultant Euler-Lagrange equations of motion. We now assume that spatial first derivative does not vanish, and proceed with a general derivation for the equations of motion of a system that is not situated at a stationary point on the potential energy surface. 122 CHAPTER 3. CONFORMATIONAL SAMPLING

Rewriting (3.2.7) to include the Jacobian term,

3N 3N X ∗ X ∗ ∗ 2V (x1, ... , x3N ) ≈ 2 (xi − xi )Ji + (xi − xi )(xj − xj )Hij , (3.3.1) i=1 i,j=1 where x∗ now corresponds to an arbitrary position vector. Transformation into the mass-weighted Cartesian form is accomplished as before by a simple modification

3N 3N X Ji X qiqj 2V (q1, ... , q3N ) ≈ 2 √ qi + √ Hij m m m i=1 i i,j=1 i j 3N 3N X X ≈ 2 Jiqi + qiqjHij , (3.3.2) i=1 i,j=1 where the Jacobian and Hessian have been re-expressed in terms of the mass- weighted Cartesians,

1 ∂V ∂V Ji = √ = (3.3.3) mi ∂xi x∗ ∂qi q∗ 2 2 1 ∂ V ∂ V Hij = √ = . (3.3.4) mimj ∂xi∂xj x∗ ∂qi∂qj q∗

From (3.2.9), the only additional calculation we require is the spatial derivative of the term involving the Jacobian in (3.3.2). As such,

3N 3N 3N ∂ X X ∂qi X ∂Ji J q = J + q ∂q i i i ∂q i ∂q k i=1 i=1 k i=1 k 3N X = Jiδik i=1

= Jk , (3.3.5)

where we have introduced the Kronecker delta, δik, such that δik = 1 if i = k, and

δik = 0 otherwise. The term involving the partial derivative of the Jacobian is equal to zero owing to the evaluation constraint we have placed on the Jacobian. The partial derivative of the potential with respect to a spatial degree of freedom 3.3. GENERALISED VIBRATIONS 123 is then 3N ∂V X = J + H q . (3.3.6) ∂q k ik i k i=1 The expanded Euler-Lagrange equation of (3.2.10) now reads

3N d2 X q + H q = −J , (3.3.7) dt2 k ik i k i=1 which is an inhomogeneous second order differential equation. Physical insight into this equation can be garnered by observing that −Jk is a force, and so (3.3.7) is the equation of motion governing a forced harmonic oscillator, where the driving force is a constant.

Solution to an inhomogeneous differential equation is found by solving the under- lying homogeneous differential equations, as we have done in (3.2.11), and append- ing a “particular solution”, which is dependent on the inhomogeneous term, ξ(Jk).

The form of ξ(Jk) is found by substituting it for the independent variable in our original homogeneous differential equation, i.e.

3N d2 X ξ(J ) + H ξ(J ) = −J dt2 k ik k k i=1 3N X Hikξ(Jk) = −Jk i=1 J ξ(J ) = − k , (3.3.8) k P3N i=1 Hik where the temporal derivative of ξ(Jk) is zero since the Jacobian possesses no time dependence. Then, the general equation of motion for our forced harmonic oscillator is J q (t) = A cos(ωt + φ) − k . (3.3.9) k k P3N i=1 Hik We can validate this solution by dimensional analysis, where the particular solution has dimensions of force over force constant, i.e. distance, as would be expected. 124 CHAPTER 3. CONFORMATIONAL SAMPLING

Also, we note that at an energetic minimum, Jk = 0 and we recover the stationary point solution of (3.2.11).

3.4 Tyche - Conformational Sampling Software

In this section, we present a number of additional concepts and their computa- tional implementations. These will then come together to form tyche, a piece of software that we have developed over the past few years. tyche is the major conformational sampling methodology used within our group for the construction of kriging training sets.

3.4.1 Transformation to Normal Coordinates

6 By finding the six basis vectors which span Rext, and exploiting the fact that the normal coordinates form a mutually orthogonal basis, the basis vectors spanning

3N−6 Rint can be found by a Gram-Schmidt orthogonalisation relative to the six ex- ternal basis vectors.

6 The basis vectors of Rext are found by invoking the Sayvetz [254] (or Eckart[255]) conditions. These correspond to defining a reference frame for the system, whose origin is located at the molecular centre of mass, and whose axes are aligned with the principal axes of inertia of the system. As such, this reference frame translates and rotates with the system, such that for an observer in this reference frame, the molecule would appear stationary.

First, the molecule is translated to a frame of reference whose origin is at the molecular centre of mass. In terms of D, the Sayvetz conditions are contained 3.4. TYCHE - CONFORMATIONAL SAMPLING SOFTWARE 125

th within the first six columns. We specify the i column of D with di. The first three columns correspond to the translational Sayvetz conditions, " #> √ √ √ d1 = m1 0 0 m2 0 0 ... mN 0 0 , (3.4.1)

" #> √ √ √ d2 = 0 m1 0 0 m2 0 ... 0 mN 0 , (3.4.2)

" #> √ √ √ d3 = 0 0 m1 0 0 m2 ... 0 0 mN . (3.4.3)

These are clearly mutually orthogonal, but we also see that the scalar product P3N with a generalised coordinate yields dk · q = i=1 mixi = 0, ∀k = 1, 2, 3. Equality to zero follows from the frame of reference having its origin coincide with the molecular centre of mass.

Defining d4, d5, d6 is significantly more involved. We introduce some novel nota- h i> tion for this purpose; an atomic position vector is defined as xα = xα yα zα , where xα, yα, zα are the components of the atomic position vector along the three Cartesian axis. Then, defining the moment of inertia tensor,   PN m (y2 + z2 ) − PN m x y − PN m x z  α=1 α α α α=1 α α α α=1 α α α    I =  PN PN 2 2 PN  . (3.4.4)  − α=1 mαyαxα α=1 mα(xα + zα) − α=1 mαyαzα     PN PN PN 2 2  − α=1 mαzαxα − α=1 mαzαyα α=1 mα(xα + yα)

The moment of inertia tensor can be diagonalised, P>IP = I0, where P contains the eigenvectors, or principal moments, of I and I0 is a diagonal matrix containing the associated eigenvalues. Then, the elements of d4, d5, d6 are of the form " #  >  > √ d4,α = xα · P2 P3 − xα · P3 P2 mα " #  >  > √ d5,α = xα · P3 P1 − xα · P1 P3 mα 126 CHAPTER 3. CONFORMATIONAL SAMPLING

" #  >  > √ d6,α = xα · P1 P2 − xα · P2 P1 mα , (3.4.5)

where the numerical index of P denotes a column, and dk,α corresponds to the th triplet of entries in dk related to the three degrees of freedom of the α atom[256]. These rotational Sayvetz conditions correspond to cross products of the princi- pal moments and atomic position vectors. However, for the sake of brevity, we simply refer the reader to the previously referenced works for a more in-depth derivation[254, 255].

6 We now have the six external basis vectors which span Rext, and one can verify that they form a mutually orthogonal set. The 3N − 6 basis vectors spanning

3N−6 Rint can then be found by a Gram-Schmidt orthogonalisation, which we now outline. We define the projection operator as a mapping from a vector, v, onto a vector u, v · u proj (v) = u . (3.4.6) u u · u

Denote the set of vectors defined by the columns of D, {d1, ... , d3N }. The first six 6 such vectors, d1, ... , d6, are those which span Rext. We generate {d7, ... , d3N } ran- domly, and “project out” the components of these vectors that are not orthogonal to the six external basis vectors {d1, ... , d6}. Invoking the projection operator, we iterate the following

i−1 X di = di − projdj (di) ∀i = 7, 3N, (3.4.7) j=1 until di · dj = δij, ∀i, j = 1, ... , 3N, where δij is the Kronecker delta. Note that we only Gram-Schmidt orthogonalise {d7, ... , d3N } since {d1, ... , d6} are already mutually orthogonal. Once all columns of D are mutually orthogonal, we normalise each so that the transformation described by D is norm-conserving.

Introducing the 3N × 3N diagonal matrix, M, whose elements are given by Mii = 3.4. TYCHE - CONFORMATIONAL SAMPLING SOFTWARE 127 √ 1/ mi, the 3N × 3N matrix E comprising the eigenvectors of the mass-weighted 2 Hessian, and the 3N × 3N diagonal matrix Ω, containing elements Ωii = ωi . We subsequently subject the Hessian to the transformation

E >D>MHMDE = Ω . (3.4.8)

2 2 The first six diagonal entries of Ω, ω1, ... , ω6 are equal to zero, and the associ- ated eigenvectors correspond to the global translational and rotational degrees of freedom of the system. The remaining 3N − 6 entries of Ω correspond to the frequencies of the internal degrees of freedom of the system.

A useful quantity that we will make use of in the next section is the reduced mass of a normal mode, which we conveniently group together into the 3N vector µ, the individual elements of which are given by h i−1 µk = (MDE)k · (MDE)k , (3.4.9)

th where (MDE)k denotes the k column of the matrix product MDE. Note then by the equations of simple harmonic motion that the force constant of the kth

2 harmonic oscillator is given by ωkµk. Indeed, the quantity MDE will be of great importance when we come to discuss redundant internal coordinates in Section 3.5. MDE is a 3N × 3N matrix, the columns of which correspond to individual normal modes, and the rows contain the Cartesian displacements associated with each normal mode.

3.4.2 Dynamics

We have previously given the equations of motion for each normal coordinate in (3.3.9), which is restated here in numerical form, J q (t + ∆t ) = q (t) + A cos(ω∆t + φ) − k . (3.4.10) k k k k k P3N i=1 Hik 128 CHAPTER 3. CONFORMATIONAL SAMPLING

There are three free parameters for each equation of motion; the amplitude, Ak, 2 the integration timestep , ∆tk, and the phase, φ. The amplitude can be discerned from the conventional equations of simple harmonic motion, where s 2Ek Ak = 2 . (3.4.11) ωkµk th Ek is the energy we have made available to the k normal mode, ωk its angular frequency and µk its reduced mass, the formulations for which we have given in Section 3.4.1.

Energy Allocation

The total thermal energy that is available to a single degree of freedom is given by a standard equipartition of energy, kBT/2. An immediate solution is to simply set each Ek = kBT/2. However, we find this choice to be unnecessarily restrictive.

The equipartition value kBT/2 is the expectation of the energy in each degree of freedom. In reality, the energy available to each degree of freedom is a dynamic quantity which fluctuates about the equipartition value as energy is redistributed through the degrees of freedom.

We are able to sample the microcanonical ensemble by working with the total ther- mal energy available to a system, (3N − 6)kBT/2. By stochastically distributing this energy through all degrees of freedom, we have k T E = (3N − 6) B U(0, 1) , (3.4.12) k 2 where U(0, 1) is a random number selected from the uniform distribution on [0, 1]. P One can subsequently rescale all energies so that k Ek = (3N −6)kBT/2. There- fore the total energy within the system remains constant throughout the sampling. 2While an integration timestep is typically the same across all degrees of freedom, we have suggestively allowed the integration timestep to vary across the degrees of freedom. 3.4. TYCHE - CONFORMATIONAL SAMPLING SOFTWARE 129

Such a choice grants a great deal of conformational freedom for the subsequent sampling. For example, there is the possibility that a single degree of freedom possesses the vast majority of the total thermal energy. The resultant equations of motion consequently permit the exploration of the high energy portions of the potential well along a single degree of freedom.

However, one also needs to safeguard against physically unrealistic conformations, where chemical bonds have actually been broken. Such conformations are of no use over the course of an MD simulation, and are more frequently encountered when exploring the high energy regions of conformational space. The aforemen- tioned safeguarding can be undertaken by requiring all generated conformers to be constrained within some conformational domain. For instance, one can constrain a subset of valence coordinates (e.g. all bond lengths and valence angles) to lie within some range of their equilibrium values, e.g. each , b, lies within

−1 the range c b0 ≤ b ≤ cb0, where c is a suitably chosen “stretching” parameter and b0 is the equilibrium bond length. Indeed, GUIs which display chemical structures use such a criterion in ascertaining whether a bond should be drawn between two atoms. b0 there is the sum of the van der Waals radii of two atoms, and c is conventionally set to be 1.2.

The distribution of molecular energies as obtained through allocating the energy by (3.4.12) is depicted in Figure 3.1. The sampling well is deep; the range of molec- ular energies is roughly 120 kJ mol-1, and spans the physically relevant regions of conformational space. However, we note that there is a significant discontinuity between the ab initio minimum energy and the tyche sampling energy. This discontinuity is not ideal, but could be reduced by an adequate parameterisation of the stretching parameter discussed above.

We have, however, decided against this methodology. The choice of the stretch- 130 CHAPTER 3. CONFORMATIONAL SAMPLING

Figure 3.1: Distribution of molecular energies obtained through tyche by use of the energy allocation scheme of (3.4.12). The lowest energy point corresponds to the ab initio minimum.

-246.96

-246.97

-246.98

-246.99

-247 Molecular Energy (Hartree) Energy Molecular -247.01

-247.02

-247.03 Conformer ing parameter is entirely arbitrary, and renders our conformational sampling more akin to a systematic sampling within some hypercube domain on the valence coor- dinates. To circumvent the necessity of parameterising the conformational domain upon which we sample, while maximising the amount of conformational space available for sampling, we can instead sample within the canonical ensemble by choosing k T E = B N (1, 1) , (3.4.13) k 2 where N (1, 1) is a random number selected from a Gaussian distribution with ex- P pectation value and standard deviation of unity. No rescaling to enforce k Ek =

(3N − 6)kBT/2 is taken with this approach, enhancing the amount of confor- mational space available to the system. N (1, 1) limits the accessibility to the physically unrealistic high energy portions of conformational space. 3.4. TYCHE - CONFORMATIONAL SAMPLING SOFTWARE 131

The distribution of molecular energies as obtained through allocating the energy by (3.4.13) is depicted in Figure 3.2. We see that the range of energies, roughly 50 kJ mol-1, has been significantly reduced relative to Figure 3.1, restricting our sampling to the lower energy regions of conformational space. We have, however, reduced the discontinuity between the ab initio minimum energy and the tyche sampling energy by a great deal.

Figure 3.2: Distribution of molecular energies obtained through tyche by use of the energy allocation scheme of (3.4.13). The lowest energy point corresponds to the ab initio minimum.

-247.005

-247.01

-247.015

-247.02 Molecular Energy (Hartree) Energy Molecular

-247.025

-247.03 Conformer

This discontinuity remains undesirable, even in Figure 3.2. Ideally, we would like the sampling to be continuous across the molecular energies.To this end, we alter (3.4.13) to read k T E = B U(0, 1)N (1, 1) . (3.4.14) k 2 We have added a uniform random variate to modify the temperature, allowing it to take on values lower than that specified by T . In this way, we essentially force 132 CHAPTER 3. CONFORMATIONAL SAMPLING sampling of the lower energy regions of the PES (U(0, 1) → 0), while retaining the ability to sample the higher energy regions of the PES (U(0, 1) → 1).

Figure 3.3: Distribution of molecular energies obtained through tyche by use of the energy allocation scheme of (3.4.14). The lowest energy point corresponds to the ab initio minimum.

-246.99

-246.995

-247

-247.005

-247.01

-247.015 Molecular Energy (Hartree) Energy Molecular

-247.02

-247.025

-247.03 Conformer

From Figure 3.3, we see that adopting the energy allocation scheme of (3.4.14), we are able to essentially remove the discontinuity in energy between the seeding ge- ometry and the sampling conformations, while still retaining the ability to sample the high energy regions of the well; the range of sampling energies is now roughly 85 kJ mol-1. Unfortunately this energy allocation scheme is not equivalent to a particular thermodynamic ensemble, but we find the sampling more in line with our requirements for the construction of a kriging model training set. 3.4. TYCHE - CONFORMATIONAL SAMPLING SOFTWARE 133

Integration Timestep

Our choice for the integration timestep is helped by ascertaining the associated time period of the degree of freedom, Tk = 2π/ωk. For each period of the mo- tion, it seems reasonable to require a number of points, nperiod, be sampled along the trajectory. Therefore, the integration timestep is given by, ∆tk = Tk/nperiod. Our reason for individually parameterising the integration timestep for each de- gree of freedom is the range of frequencies spanned by the normal modes (typi- cally greater than an order of magnitude). If we were to set a global integration time step, governed by the highest frequency degree of freedom, then the quan- tity qk(∆t) = qk(t + ∆t) − qk(t) for a low frequency degree of freedom will be extremely small. By allowing each degree of freedom to have an associated inte- gration timestep governed by its own dynamics, we can ensure an expansive con- formational sampling. The reader is directed towards the discussion of the RESPA in Section 3.1, where parallels can be drawn. We note that a single parameter is required to this end, nperiod, which is user-defined.

Phase

The final free variable in (3.4.10) is the phase of the equation of motion, φ. The meaning associated with this value, as far as we can ascertain, is largely irrelevant. There appears to be no physically intuitive manner to define these phases. The only requirement is that 0 ≤ φ ≤ 2π, whose implementation is not strictly necessary since our equations of motion are periodic in φ. As such, each φ is randomly selected, and is redefined after the output of each conformer. 134 CHAPTER 3. CONFORMATIONAL SAMPLING

Sample Output

All free variables have now been given and we are in a position to undertake a conformational sampling. In practice, we do not output each conformer generated for every integration timestep. Conformers from successive integration timesteps are highly correlated with one another, and are typically similar. Ideally, we would like to remove all correlation between outputted conformers. To do so, we evolve the system with an initial parameterisation of the {Ek, φ}. After a complete time period has been completed, the {Ek, φ} are randomly redefined and the process is iterated. The output of a conformation is governed by, for each time period, selecting a random integer between [1, nperiod], and outputting the conformer which has been generated after that number of timesteps. In this way, we avoid the necessity of adding another parameter that governs the frequency with which conformers are output. Since conformers from systems governed by differing {Ek, φ} are uncorrelated (if the random numbers used to define these values are uncorrelated), the sampled conformers are also uncorrelated.

3.4.3 Markov Chains on Tessellated PESs

Our formulation so far permits a local exploration of the ab initio PES (to second order) about a seeding geometry. However, to construct training sets that are of use for MD, we require a more thorough exploration of conformational space. One way to achieve a global exploration of the PES is to use a number of local explorations. In this way, one reconstructs the PES by tessellating with LWs. In the limit of infinite such LWs, one obtains the full ab initio PES.

Parallels between the tessellation of the PES by a number of LWs and the con- 3.4. TYCHE - CONFORMATIONAL SAMPLING SOFTWARE 135 struction of a PES from interpolating between points of the PES[15, 257] can be drawn. However, while the interpolation schemes use just the molecular energy (and sometimes first derivatives) for the fit, we use the higher order topological information contained within the Jacobian and Hessian, which is arguably a supe- rior use of ab initio data. In addition, the interpolation schemes are typically only viable for small molecular PESs[258, 259], whereas our method is not particularly bounded by system size.

An issue with the tessellation approach is that a series of LWs cannot be trivially linked into a continuous PES upon which one can conduct classical dynamics. Since each LW is approximated as quadratic in the displacement from the seeding geometry, the sides of the LW cannot be joined in a continuous fashion without interpolating between them in some way. This is perhaps an avenue that could be explored in the future, but is avoided in our current formulation.

Instead, we have decided to forsake the benefits associated with continuous dynam- ics for a stochastic approach. Indeed, it could be argued that continuous dynamics are of no use for the purposes of conformational sampling. To this end, we have proposed the use of dynamics within a LW, followed by “hopping” to another LW, and iterating this process. The trajectory is then continuous within a well, and discontinuous with respect to hopping events. Naturally, the ideal candidate for the hopping scheme is the Metropolis method, leading to the hopping events taking the form of a Markov chain.

A Markov chain is defined as a stochastic process where the probability of an event

(the probability of a state xn given the past states of the system xn−1, ... , x1,

P (xn|xn−1, ... , x1)) is equal to its probability conditional only on the previous event[260], i.e.

P (xn|xn−1, ... , x1) = P (xn|xn−1) . (3.4.15) 136 CHAPTER 3. CONFORMATIONAL SAMPLING

In other words, the history of the system is irrelevant in discerning the next state of the system. One can subsequently form the probability of a sequence of states as

P (xn, ... , x1) = P (xn|xn−1)P (xn−1|xn−2) ... P (x2|x1) . (3.4.16)

Consider a stochastic matrix of such conditional probabilities, Γij = P (xj|x1). P We require Γ to be row-normalised, i.e. j Γij = 1, which satisfies the require- ment that the system can transition from xi to another state within the sample space[261]. Naturally, each element of 0 ≤ Γij ≤ 1. There exists a unique left- eigenvector, γ, of the stochastic matrix, such that

γ> · Γ = γ , (3.4.17) where γ is a Perron-Frobenius eigenvector, with associated eigenvalue of one. All other eigenvectors of the stochastic matrix possess eigenvalues smaller than the Perron-Frobenius eigenvector. γ then corresponds to an equilibrium state of the system, to which a Markov chain will converge to, given the system is ergodic[262].

We require that any closed trajectory of the form P (x1, xn, ... , x1) satisfy the Kolmogorov criterion and form a reverisble Markov chain, whereby

P (x1|xn)P (xn|xn−1) ... P (x2|x1) = P (x1|x2) ... P (xn−1|xn)P (xn|x1) .

This condition can be enforced by the detailed balance condition3,

ΓijP (xi) = ΓjiP (xj) . (3.4.18)

Expanding out the elements of the stochastic matrix and rearranging, we obtain

P (x |x ) P (x ) j i = j . (3.4.19) P (xi|xj) P (xi)

3In fact, the detailed balance condition is excessively strict for our purposes; one can con- struct Markov processes with Frobenius-Perron eigenvectors that do not satisfy detailed balance. However, detailed balance is a condition that is readily implementable. 3.4. TYCHE - CONFORMATIONAL SAMPLING SOFTWARE 137

We write the elements of the stochastic matrix as a product of two separate terms,

P (xj|xi) = g(xj|xi)A(xj|xi). The proposal distribution, g(xj|xi), and the ac- ceptance distribution, A(xj|xi), correspond to the probability of proposing and accepting the state xj from the state xi, respectively. Rewriting (3.4.19), A(x |x ) P (x )g(x |x ) j i = j i j . (3.4.20) A(xi|xj) P (xi)g(xj|xi) Our final step requires that we choose an acceptance probability that satisfies the above condition. We require that the acceptance probability be constrained from above by unity, and so introducing the Metropolis choice, " # P (xj)g(xi|xj) A(xj|xi) = min 1, . (3.4.21) P (xi)g(xj|xi) The Metropolis algorithm is presented in Algorithm1. For our purposes, we have no condition to place on the proposal distribution, so for simplicity we set it to be the uniform distribution. The acceptance distribution takes the form the the stan- dard Boltzmann weighting scheme, where A(xj|xi) = exp[−∆E(xj|xi)/kBTmet], where ∆E(xj|xi) = E(xj) − E(xi), the energy difference between the seeding ge- ometries. We have also introduced Tmet, the Metropolis temperature, which is distinct from the temperature with which we perform the dynamics. This dis- tinction allows for a more flexible hopping scheme, where the user can artificially increase or decrease the transition probability if more or less (respectively) sam- pling is required.

3.4.4 Algorithm

Given nseed seeding geometries, an ab initio calculation is performed for each, and the energy, Jacobian and Hessian are found for each. Our aim is then the gener- ation of a conformational sampling set, S, comprising nsample distinct molecular conformations. 138 CHAPTER 3. CONFORMATIONAL SAMPLING

Select a random starting state, x1 xi = x1 ; xj = 0 while nmarkov < n do while xj = 0 do 0 0 Propose a new state, x , in accordance with g(x |xi) 0 0 Accept a transition to x with probability A(x |xi) given by (3.4.21) 0 0 Transition to x and add to Markov chain, xj = x end xi = xj ; xj = 0 nmarkov = nmarkov + 1 end Algorithm 1: The Metropolis Algorithm

For each seeding geometry, the transformation matrix D is computed and used to transform the Jacobian and Hessian into a normal coordinate basis. The Hessian is diagonalised, and the Cartesian state vector x, angular frequencies ω, reduced masses µ, inhomogeneous terms ξ and matrix MDE are stored.

The sampling itself is conducted by the formation of a Markov chain on the seeding geometries. At each step of the Markov chain, the seeding geometry undergoes vibrational analysis subject to the methodology outlined in Section 3.4.2. Samples are added to S until it contains nsample conformations, at which point tyche terminates.

In Figure 3.4, we give a simple one-dimensional example of the tyche method- ology. Seven seeding geometries are used, and labelled A-G. About each of these seeds, we approximate the PES as a LW, upon which dynamics can be conducted. It is worth noting that point F constitutes a non-stationary point seed; the dynam- ics about this point are governed by the motion of a driven harmonic oscillator, driving the minimum of the LW towards the local minimum on the PES.

Of course, we appreciate that the tessellation of LWs becomes a poorer approxi- mation to the ab initio PES in higher dimensions, and as the density of seeding 3.4. TYCHE - CONFORMATIONAL SAMPLING SOFTWARE 139

Figure 3.4: One-dimensional analytical PES (blue) and the tyche-generated PES (red), recreated through the seeding geometries A-G. Stochastic dynamics are performed upon this tessellation of LWs in an attempt to generate a sample set that would be representative of continuous dynamics on the ab initio PES.

geometry decreases. In reality, the more seeding geometries included, the better the representation of the PES and the more realistic the dynamics.

For the sake of argument, we commence our sampling within the LW of C. Dynam- ics are performed within this well and a sample is output. Another seeding point is then selected, B, and an attempt to transition to this LW is attempted. Notice that the transition from C to B is unfavourable since B is of a higher energy than C. The transition to C is subsequently rejected, and another seeding geometry is selected, G. The transition from C to G is favourable since G is lower in energy than C, and so the transition occurs, and dynamics are conducted within the LW of G. This process is iterated until the requested number of samples have been output.

In Table 3.1, we present the memory requirements of running the tyche algorithm. 140 CHAPTER 3. CONFORMATIONAL SAMPLING

Term Number of Elements Memory Required (floats)

x 3N nseed × 3N

MDE 3N × 3N nseed × 3N × 3N

ω 3N nseed × 3N

µ 3N nseed × 3N

ξ 3N nseed × 3N

3nseedN(4 + 3N)

Table 3.1: Memory requirements for tyche. Assume we have a system comprising 100 atoms. Given a modest 2 GB of memory, we are then able to reconstruct the PES with almost 22000 seeding geometries, which is certainly a significant number. Of course, each one of these geometries would need to undergo an ab initio calculation in order that the Jacobian and Hes- sian could be computed for each, which would be prohibitively expensive for 22000 conformers comprising 100 atoms. However, the tyche methodology does not ap- pear to be limited by memory requirements. The algorithm also runs extremely quickly, and so computational time is not prohibitive either.

3.5 Redundant Internal Coordinates

3.5.1 Overview

The redundant internal coordinates (RICs) of a molecular system constitute a basis within which the molecular geometry can be full specified, but the number of basis functions exceeds the 3N−6 internal degrees of freedom of the system. Historically, RICs have been a useful choice for geometry optimisations of molecular systems. It has been found that expressing the Hessian in a RIC basis minimises the couplings between the degrees of freedom, i.e. the off-diagonal elements of the Hessian 3.5. REDUNDANT INTERNAL COORDINATES 141 are minimised. Herein, we will use N to represent the number of non-redundant degrees of freedom, M the number of RICs, where M = N + R, R being the number of redundancies in the RIC system in question[263].

The choice of coordinates with which a molecular conformation is expressed can have profound consequences on a number of computational methodologies. For example, it can be shown that a rectilinear basis, e.g. Cartesians, is not suitable for expressing large displacements involving curvilinear degrees of freedom, such as torsional motions[264]. Attempting to express a displacement in a rectilinear basis can lead to artificial lengthening of a number of other curvilinear degrees of freedom, such as bond stretches[265].

Transforming from an N-dimensional rectilinear coordinate system, x, to an M- dimensional curvilinear coordinate system, q, requires a Taylor expansion of the curvilinear coordinates in terms of the rectilinear coordinates. Or, for the ith curvilinear coordinate, qi (x1, ..., xN ),

N N 2 0 X ∂qi 0 1 X ∂ qi 0 0 qi − qi = Ai + xj − xj + xj − xj xk − xk + ... (3.5.1) ∂xj 2 ∂xj∂xk j=1 j,k=1 where the derivatives are constrained by the evaluation condition q = q0. Intro- ducing the difference coordinates, ∆x = x − x0 and ∆q = q − q0, and the partial derivatives 2 ∂qi ∂ qi Bij = Cijk = ∂xj q=q0 ∂xj∂xk q=q0 we can drop the arbitrary shift Ai and rewrite the above as N N X 1 X ∆q = B ∆x + C ∆x ∆x + ... i ij j 2 ijk j k j=1 j,k=1 (3.5.2) 1 ∆q = B∆x + ∆x>C∆x + ... 2 If we consider only infinitesimal displacements in the difference coordinates, then displacements in q will become first order with respect to displacements in x. Thus, 142 CHAPTER 3. CONFORMATIONAL SAMPLING we are able to neglect all second order and higher terms in the above expansion, and subsequently

N X δqi ≈ Bijδxj j=1 (3.5.3) δq ≈ B · δx For infinitesimal displacements, the transformation from a Cartesian to a redun- dant internal coordinate basis is given by the “Wilson B-matrix”, B : RN → RM , whose elements are the derivatives of the redundant internal coordinate with re- spect to the Cartesian degrees of freedom.

Writing the potential energy of a molecular system as a Taylor series in the RICs, 1 V (∆q) = V + g>∆q + ∆q>H ∆q , (3.5.4) 0 q 2 q where gq and Hq are the gradient and Hessian of the potential energy in the RIC basis, respectively. Substituting the power series in ∆x for ∆q, and equating equal powers in the displacement with a Taylor series of the potential energy in the Cartesian coordinates, we obtain the following transformations[266, 263]

> > > gx = B gq Hx = B HqB + gq C . (3.5.5)

It is worth pointing out that since the Hessian transforms linearly in the gradi- ent, it is, by definition, a covariant tensor. Then, Hq requires modification with Christoffel symbols for it to take the form of a tensor comprising second derivatives with respect to the RICs[267].

3.5.2 Valence Coordinates

The most intuitive set of RICs are perhaps those defined by the connectivity of the molecule. The valence coordinates are such a set, comprising bond stretch, valence 3.5. REDUNDANT INTERNAL COORDINATES 143 angle and dihedral torsional degrees of freedom, and are depicted in Figure 3.5. We describe each of these in turn and their respective derivatives with respect to the Cartesian degrees of freedom.

In evaluating the derivatives of the valence coordinates, one can adopt either of two approaches. The first is purely analytical, with a great deal of cumbersome algebraic manipulation[268], whereas the second uses the fact that the gradient vector of a scalar field is directed towards the maximum increase in the scalar field, which can be ascertained by inspection[269]. We adopt the former approach since we deal only with simple valence coordinates.

Bond Stretching

Given two atomic position vectors, xα and xβ, we define the bond vector between the two as xαβ = xβ − xα (i.e. the vector directed from α to β), and the bond αβ length rαβ = ||xαβ||. For this case, the RIC is the bond length, i.e. qbond = rαβ. αβ The derivative of qbond with respect to the Cartesian degrees of freedom of atoms γ 6= α, β are obviously zero, since the displacement of these atoms does not effect αβ qbond.

αβ The derivatives of qbond with respect to the components of xα are given by the αβ components of the unit vector that maximise qbond when displacing atom α along the unit vector. By inspection, we verify that the unit vector is simply −xαβ/rαβ.

Similarly, the derivative of the components of xαβ with respect to the elements of xβ are given by the unit vector that maximises rαβ, which is xαβ/rαβ. Therefore, 144 CHAPTER 3. CONFORMATIONAL SAMPLING

Figure 3.5: The conventionally invoked valence coordinates. (a) corresponds to bond stretching, (b) corresponds to valence angle bending, and (c) corresponds to dihedral torsion. The major components that will be required in the following discussion are shown.

the derivatives of the bond stretching RICs are given by   −(x ) /r (x ∈ x )  αβ i αβ i α ∂qαβ  bond = (x ) /r (x ∈ x ) , (3.5.6) ∂x αβ i αβ i β i    0 otherwise

th where (xαβ)i denotes the i component of the vector. 3.5. REDUNDANT INTERNAL COORDINATES 145

Valence Angles

Take three atomic position vectors, xα, xβ and xγ. This system comprises two bond vectors, xγα and xγβ, directed from the apex atom γ to the two terminal atoms α, β, respectively. As before, the bond lengths of these bond vectors are αβ,γ given by rγα = ||xγα|| and rγβ = ||xγβ||. The valence angle RIC, qangle, is the angle subtended by xγα and xγβ from xγ, and is defined as ! αβ,γ xγα · xγβ qangle = arccos (3.5.7) rγαrγβ

αβ,γ The derivatives of qangle with respect to the components of one of the terminal atoms, say xα for argument, are given by the components of the unit vector that αβ,γ maximise qangle when displacing α along the unit vector. By inspection, we see that this unit vector is perpendicular to xγα in the plane of the page, and can be expressed as the double cross product xγα × xγβ × xγα. Then, normalising this quantity,  αβ,γ ∂q  xγα × xγβ × xγα/rγα (xi ∈ xα) angle = , (3.5.8) ∂xi  xγβ × xγα × xγβ/rγβ (xi ∈ xβ) where the expression for xi ∈ xβ follows by symmetry. A rigid translation of the molecule does nothing to alter the valence angle. Suppose we shift the apex atom γ an infinitesimal amount in one direction, and subsequently shift the entire molecule by an opposite amount; the change in valence angle can then be expressed in terms of the sum of contributions from the shift in the terminal atoms with reversed sign, i.e. αβ,γ ∂qangle xαγ × xβγ × xαγ xβγ × xαγ × xβγ = − − (xi ∈ xγ) (3.5.9) ∂xi rαγ rβγ 146 CHAPTER 3. CONFORMATIONAL SAMPLING

Torsion and Additional Valence Coordinates

Take four atomic position vectors, xα, xβ, xγ and xδ. The system comprises three αγ,β δβ,γ bond vectors, xαβ, xγδ and xβγ, and two valence angles, qangle and qangle. The torsion torsional angle, qαβγδ , is defined as the angle between the planes spanned by

{xαβ, xβγ} and {xγδ, xβγ}, as observed along the xβγ bond vector. Analytically, αβγδ we can express qtorsion as ! (x × x ) · (x × x ) qαβγδ = arccos αβ βγ βγ γδ . (3.5.10) torsion 2 αγ,β δβ,γ rαβrγδrβγ sin(qangle) sin(qangle) It is hoped that the author will be forgiven for not presenting a formal derivation of the derivatives of this quantity with respect to the Cartesian degrees of freedom of the system. The derivatives can be obtained by the same methodology employed in the previous sections, and so are simply presented   xαβ ×xβγ − αγ,β (xi ∈ xα)  r2 r sin2 q  αβ βγ angle  αγ,β   x ×x r −r cos(q ) αβγδ  αβ βγ βγ αβ angle xδγ ×xγβ  αγ,β − δβ,γ δβ,γ (xi ∈ xβ) ∂qtorsion  r2 r2 sin2(q ) r2 r tan(q ) sin(q ) = βγ αβ angle βγ γδ angle angle , ∂xi  (x ∈ x )(x ∈ x )  PαδPβγ i β i γ    PαδPβγ(xi ∈ xα)(xi ∈ xδ) (3.5.11) where we have introduced the permutation operator, Pab, which permutes the indices of the function it operates on. In the above case, the permutation operator acts on either the expression for (xi ∈ xβ) or (xi ∈ xα), the explicit forms of which have been given.

A number of additional valence coordinates can be defined, such as the four-body angle between a bond and a plane (out-of-plane RIC)[270]. Since, by definition, there is no upper bound on the number of RICs employed to define a molecular sys- tem, one can introduce as many valence coordinates as desired. However, we note 3.5. REDUNDANT INTERNAL COORDINATES 147 that for geometry optimisation, simply adding more RICs results in more coupling between the degrees of freedom, and so the Hessian loses its sparse characteristic.

3.5.3 Implementation

The normal modes of motion with which we evolve the system are inconvenient if we wish to direct our sampling to physically meaningful regions of conformational space. For example, if we are able to identify that we require a rigorous sampling of a certain internal coordinate, we are unable to predominantly sample that degree of freedom in a normal mode basis.

However, if we are able to decompose the normal modes of a system into a valence coordinate basis, we would be able to select those normal modes that are dominated by certain physically meaningful motions, and direct our sampling by selectively evolving those normal modes.

Decomposing the normal modes of a system into contributions from valence co- ordinates is quite a simple process. Recall in Section 3.4.1 we introduced the quantity MDE as the Cartesian displacements associated with each normal mode. Since a Cartesian state vector transforms to a RIC basis through (3.5.3), we simply require evaluation of the matrix multiplication BMDE to transform the normal modes into a valence coordinate basis.

Demonstrating the results of this computation, we recall that the three normal modes of water are the “symmetric stretch”, “asymmetric stretch” and ”valence angle bending” modes of motion. The first two involve only bond stretching degrees of freedom, while the third involves predominantly the angle bending degree of freedom, along with a small amount of bond stretching. Figure 3.6 shows the 148 CHAPTER 3. CONFORMATIONAL SAMPLING output of tyche concerning the three normal modes of water.

Figure 3.6: tyche output outlining the normal mode information of water. Nor- mal modes 1, 2 and 3 correspond to the valence angle bending, symmetric stretch- ing and asymmetric stretching modes, respectively. The RIC contributions to each normal mode are given at the bottom of the figure.

The section labelled “Normal Coordinates” reveals the Cartesian displacements associated with each normal mode, as computed through MDE. The section labelled “Redundant Internal Coordinates” denotes the valence coordinate contri- butions to each normal mode. For the low frequency mode, we find that 89% of the motion corresponds to the valence angle bending, while the remaining 11% results from bond stretching. For the two higher frequency modes, we find that the overwhelming majority of motion derives from bond stretching.

This methodology is applicable to any system, allowing us to propose a simple means by which we can selectively sample regions of valence coordinate space. The tyche input has two keywords: “WhichModes” and “MaxModes”. “Which- Modes” can take three separate values; “Bonds”, “Angles” and “Dihedrals”. When one of these options is selected, tyche will take the “MaxModes” normal modes with the largest average contribution from the valence coordinates specified by 3.6. RESULTS 149

“WhichModes”. For the sake of demonstration, if the user set “MaxModes = 10” and “WhichModes = Dihedrals”, tyche will sample the 10 normal modes with the highest contribution from dihedral valence coordinates.

One can envisage a number of ways in which tyche can sample using this infor- mation. For instance, it could bias the total energy into a subset of the normal modes, while still putting some energy into the remaining normal modes. Cur- rently, tyche will only vibrate “MaxModes” normal modes. We demonstrate the power of this methodology in Section 3.6.3.

3.6 Results

3.6.1 Validation and Benchmarking

In the following, we present a systematic parameterisation of those free variables we have alluded to in previous sections. We have attempted to makes this section as terse and condensed as possible for the sake of brevity, and so necessarily omit the results of auxiliary experiments, leaving only those results of significance.

Parameterisetion of nperiod

Our first benchmarking study assesses the effects of choice of nperiod. Our aim is to obtain as much conformational sampling as possible, and so we have cho- sen to determine the optimum choice of nperiod to be that which maximises the degree of conformational sampling. In Tables 3.2, 3.3 and 3.4, we present the range and standard deviations of a prominent degree of freedom for water (valence 150 CHAPTER 3. CONFORMATIONAL SAMPLING angle), NMA (peptide dihedral) and zwitterionic histidine (sidechain dihedral), where the chosen degrees of freedom are specified in the respective parentheses. The CPU time required to obtain conformational ensembles comprising 4000 con- formers is also presented. The Jacobian and Hessian have been calculated at the B3LYP/aug-cc-pVDZ level of theory. We have used a temperature of 1750 K for the sampling, which seems excessive, but is in fact arbitrary for our purposes, and simply amplifies the results obtained at lower temperatures.

nperiod Range (degrees) Standard Deviation CPU Time (seconds) (degrees) 1 × 101 34.3 5.3 0.116 1 × 102 32.9 5.3 0.204 1 × 103 33.3 5.3 1.243 1 × 104 33.3 5.3 11.505 1 × 105 33.3 5.3 113.318

Table 3.2: Conformational sampling of water at a number of values of nperiod.

nperiod Range (degrees) Standard Deviation CPU Time (seconds) (degrees) 1 × 101 108.6 19.3 0.807 1 × 102 107.5 19.3 1.899 1 × 103 106.9 19.3 12.469 1 × 104 106.9 19.3 114.530 1 × 105 106.9 19.3 1152.896

Table 3.3: Conformational sampling of NMA at a number of values of nperiod. 3.6. RESULTS 151

nperiod Range (degrees) Standard Deviation CPU Time (seconds) (degrees) 1 × 101 56.8 5.6 3.416 1 × 102 52.0 5.6 6.081 1 × 103 52.1 5.6 33.423 1 × 104 52.1 5.6 298.152 1 × 105 52.2 5.6 2867.963

Table 3.4: Conformational sampling of histidine at a number of values of nperiod.

Our results are conclusive, in that the value of nperiod is largely irrelevant, in both its magnitude and the system in which it is used. In fact, the range of conformational sampling appears to decrease slightly as nperiod increases. This is a particularly useful trend, since the time required to generate the conformational ensemble increases linearly in nperiod, and so becomes prohibitive for large systems.

Therefore, it seems advisable to set nperiod to some suitably low value. For the remainder of our work, we will use nperiod = 10.

Level of Theory

Next, we investigate the effects of the level of theory with which the Jacobian and Hessian are computed. Once again, our criterion for selecting the optimal level of theory is the degree of conformational sampling we are able to obtain with each level of theory. This study could be all the more exhaustive by investigating level of theory/ basis set combinations, but we feel this to be superfluous. Our systems of choice are as above (water, NMA and zwitterionic histidine), where the same degrees of freedom are evaluated. We have used the aug-cc-pVDZ basis set for each test case, and once again a temperature of 1750 K. 152 CHAPTER 3. CONFORMATIONAL SAMPLING

Level Of Theory Range (Degrees) Standard Deviation (Degrees) Hartree Fock 30.9 4.7 B3LYP 34.3 5.3 M06 34.4 5.3 MP2 34.2 5.3

Table 3.5: Conformational sampling of water using the Jacobian and Hessian from a number of different levels of theory.

Level Of Theory Range (Degrees) Standard Deviation (Degrees) Hartree Fock 113.2 22.6 B3LYP 128.8 25.1 M06 118.9 24.7 MP2 116.2 21.7

Table 3.6: Conformational sampling of NMA using the Jacobian and Hessian from a number of different levels of theory.

Level Of Theory Range (Degrees) Standard Deviation (Degrees) Hartree Fock 32.9 3.6 B3LYP 56.8 5.6 M06 38.2 4.3 MP2 62.0 6.9

Table 3.7: Conformational sampling of histidine using the Jacobian and Hessian from a number of different levels of theory.

Computing the Jacobian and Hessian at the Hartree Fock level of theory seems to quite severely limit the amount of conformational space available to the systems. In water, only the lowest frequency mode is able to sample the valence angle degree 3.6. RESULTS 153 of freedom, so by looking at the force constant of this mode across the levels of theory, we are able to see the reason why Hartree Fock is so restrictive. The force constants associated with the valence angle mode are 2.10, 1.70, 1.68 and 1.68 −1 mDyne A˚ for the HF, B3LYP, M06 and MP2 levels of theory, respectively. This higher force constant prohibits sampling of the valence angle mode given the same amount of energy, and so conformational flexibility is reduced.

It seems as if B3LYP allows for the greatest conformational flexibility across the levels of theory investigated, certainly with regards to the NMA peptide dihedral. Both M06 and MP2 perform relatively well, but appear to be inconsistent in the amount of conformational flexibility that they allow. As such, we recommend the use of the B3LYP level of theory if the aim is to obtain as diverse a conformational sampling as possible.

Temperature

Our final parameterisation involves ascertaining the temperature at which we can construct a conformational ensemble spanning the relevant regions of conforma- tional space. An approximation as to what constitutes the relevant regions of conformational space is obtained by benchmarking against the conformational en- sembles used in the construction of alternative force fields. A particularly useful set of papers to this effect present a number of useful statistics regarding the conformational ensembles used to parameterise a generic force field for use with alkanes[271], organic compounds[272], peptides[273], carbohydrates[274] and ionic species[275]. The conformational ensembles in these studies have been constructed by a systematic sampling along each normal mode of motion for given molecules in their global minimum energy conformations. 154 CHAPTER 3. CONFORMATIONAL SAMPLING

In the following discussion, we make reference to the “temperature”, which can be ambiguous since its invocation can be used in the context of an MD trajectory or the temperature used in tyche as defined in Section 3.4.2. When discussing the temperature at which an MD trajectory has been computed, we explicitly write “MD temperature”, or a phrase to that effect. If the word temperature is used without any further clarification, the sampling temperature used for tyche is implied.

We immediately reject the notion that using a temperature equal to that at which the desired MD trajectory is to be computed is adequate for the construction of a valid conformational ensemble. Our approximation of the LW necessitates a revised attitude to our use of temperature. The amount of energy4 required to dis- tort a system from the seeding geometry increases quadratically in the distortion. Escape from the LW is therefore never permitted, whilst in reality, after a short displacement from the seeding geometry, the potential energy need not increase monotonically. Unless the seeding geometry lies within a particularly deep well on the true PES, it is highly likely that a true local exploration of conformational space about the seeding geometry is equivalent to a high temperature exploration of the LW.

In the following, we benchmark against three separate sets of data. The first two sets of data derive from the aforementioned references involving the parameteri- sation of a generic force field for use with alkanes[271] and peptides[273]. Both of these papers use seventeen distinct molecules for sampling, but we have chosen four from each, representing a cross-section of the types of molecules studied. For the alkanes, we have selected methane, ethane, neopentane and cyclopentane, while

4The notion of “energy” and “temperature” are interchangeable since they are linearly pro- portional. Our discussion is invariant with regards to the proportionality constant linking these two quantities. 3.6. RESULTS 155 for the peptides we have selected formamide, N-methylacetamide, methylazacyclo- propanone and N,N-dimethyl formamide The third set of data is a MD trajectory for zwitterionic alanine that we have computed using the MD package in conjunction with the AMBER99SB force field. The MD trajectory has been computed at 300 K in the NPT ensemble with a Parrinello-Rahman barostat. We are brief on these details since they are identical to the MD trajectories that are computed in Section 6.4.1, where the reader is directed if they wish for a thorough description.

We have generated conformational ensembles using tyche at a number of tem- peratures. For each computation, we have used nperiod = 10 and computed the Hessian and Jacobian at the B3LYP/aug-cc-pVDZ level of theory. We use only the global minimum as a seeding geometry for the molecules enumerated above. For this reason, we only benchmark against bond stretching and valence angle bending degrees of freedom since the torsional degrees of freedom are only prop- erly sampled by seeding with a number of distinct molecular geometries (which we tackle in Section 3.6.2). We have used 500 conformers in the construction of each conformational ensemble obtained through tyche.

The resultant conformational ensemble computed using tyche is compared to that obtained from the previously discussed sources. We have decided to use the range of sampling of a number of selected degrees of freedom for benchmarking. For example, suppose a system possesses a number of similar bonds (e.g. CH, NH, etc.). The range of this degree of freedom is then defined by the maximum and minimum sampled values of that bond length across all similar bonds within the system. It is worth noting that our goal is not to faithfully recreate the ranges of sampling in references [271, 273] and the zwitterionic alanine MD trajectory. Instead, we see these as minimum requirements for sampling, and indicative of the 156 CHAPTER 3. CONFORMATIONAL SAMPLING relevant ranges of values taken by those degrees of freedom.

For the alkanes, the degrees of freedom against which we benchmark are the CH bond stretching and HCH valence angle bending degrees of freedom. Alkane sam- pling is summarised in Figures 3.7 and 3.8. These diagrams show the range of sampling as obtained through tyche of the CH bond stretch and HCH valence angle bending degrees of freedom, respectively, for the four alkanes we have men- tioned. As the temperature is increased, the range of sampling increases, as ex- pected. Also shown in these figures are the maximum and minimum values of the corresponding degrees of freedom taken by over 99% of conformers in reference [271]. We see that for the CH bond stretch, we sample roughly the same range of bond lengths as the referenced work at a temperature of 900 K. Similarly, for the HCH valence angle bend, we sample roughly the same range of valence angles as the reference work at a temperature of 900 K.

Moving onto the peptides, the degrees of freedom against which we benchmark are the CN bond stretching and OCN valence angle bending degrees of freedom. Peptide sampling is outlined in Figures 3.9 and 3.10. These diagrams show the range of sampling as obtained through tyche of the CN bond stretch and OCN valence angle bending degrees of freedom, respectively, for the four peptides we have mentioned. 3.6. RESULTS 157

Figure 3.7: Range of CH bond stretch degrees of freedom in methane (green), ethane (red), neopentane (yellow) and cyclopentane (blue) as the temperature is increased. The black lines correspond to the range of sampling undertaken in Maple et al.

1.35

1.3

1.25

1.2

1.15

1.1

1.05 Bond Length (Angstroms) Length Bond 1

0.95

0.9 300 600 900 1200 1500 1800

Temperature (Kelvin)

Methane Neopentane Ethane Cyclopentane

Figure 3.8: Range of HCH valence angle degrees of freedom in methane (green), ethane (red), neopentane (yellow) and cyclopentane (blue) as the temperature is increased. The black lines correspond to the range of sampling undertaken in Maple et al.

140

130

120

110

100 Valence Angle (Degrees) Angle Valence

90

80 300 600 900 1200 1500 1800

Temperature (Kelvin)

Methane Neopentane Ethane Cyclopentane 158 CHAPTER 3. CONFORMATIONAL SAMPLING

Figure 3.9: Range of CN bond stretch degrees of freedom in formamide (green), N-methylacetamide (red), methylazocyclopropanone (yellow) and N,N- dimethylformamide (blue) as the temperature is increased. The black lines corre- spond to the range of sampling undertaken in Maple et al.

1.7

1.6

1.5

1.4 Bond Length (Angstroms) Length Bond

1.3

1.2 300 600 900 1200 1500 1800

Temperature (Kelvin)

Formamide Methylazocyclopropanone N-methylacetamide Dimethylformamide

Figure 3.10: Range of OCN valence angle degrees of freedom in formamide (green), N-methylacetamide (red), methylazocyclopropanone (yellow) and N,N- dimethylformamide (blue) as the temperature is increased. The black lines corre- spond to the range of sampling undertaken in Maple et al.

155

150

145

140

135

130

125 Valence Angle (Degrees) Angle Valence

120

115

110 300 600 900 1200 1500 1800

Temperature (Kelvin)

Formamide Methylazocyclopropanone N-methylacetamide Dimethylformamide 3.6. RESULTS 159

From Figure 3.9, it is evident that tyche vastly oversamples the higher end of the CN bond length range relative to reference [273], particularly the methyla- zocyclopropanone species. Contrastingly, the lower end of the CN bond length range is not sampled at all through tyche. A similar result is seen in Figure 3.10, where we only really sample the full range as obtained in reference [273] by 1800 K. Even at 1800 K, half of the range is sampled by a single molecular species, methylazocyclopropanone, and so our sampling can hardly be considered successful.

We note that the sum of the covalent radii of carbon and nitrogen is roughly 1.4A.˚ However, the range of sampling claimed in reference [273] for the CN bond length is 1.22 − 1.51A.˚ To claim that the range of sampling of this degree of freedom is bounded from below by a value that is roughly 20% smaller than the sum of the atomic covalent radii seems spurious. Indeed, from the MD trajectory of zwitterionic alanine presented later in this section, the sampling range of the CN bond length is roughly 1.35 − 1.60A,˚ almost perfectly in keeping with the results that we have obtained in Figure 3.9 at a temperature of roughly 600 K. Similarly, the equilibrium peptide bond angle OCN is roughly 120◦ is the majority of systems, whereas the range of sampling claimed in reference [273] for this degree of freedom is 114 − 146◦. Once more, claiming that the range of sampling of this degree of freedom is bounded from above by a value greater than 20% of the equilibrium bond angle seems excessive.

We stop short of claiming the data given in [273] is incorrect. Analysis of Figure 3.10 sheds a little more light on the discrepancies we have encountered. For ex- ample, two groups of peptide species sample completely opposite ends of the sam- pling range; formamide, N-methylacetamide and N,N-dimethylformamide sample the low end of the range, while methylazocyclopropanone samples the other, with 160 CHAPTER 3. CONFORMATIONAL SAMPLING virtually no overlap between the two groups. This result can be immediately ac- counted for since the nitrogen in methylazocyclopropanone is situated in a three- membered ring, and thus its motion is severely constrained relative to the other molecules. Indeed, the three-membered ring sterically clashes with the ketone por- tion of the molecule when the OCN valence angle is 120◦, and so the equilibrium OCN valence angle is more obtuse.5

In [273], a sample set of seventeen distinct molecules are used. We therefore presume that given the full sampling of each of these molecules, one could sample the full range of the CN bond stretching and OCN valence angle bending degrees of freedom reported. However, since we only consider a small subset of the molecules used, it stands to reason that we are not able to sample the full range reported. Unfortunately, without an analysis of the seventeen molecules used in [273], it is difficult to be conclusive about the temperature that should be used for sampling with tyche.

5One could question why this discrepancy is not evident with the alkanes we have analysed. In response, we indicate that alkanes are typically very similar, and are not amenable to adopting conformations the disrupt the tetrahedral bonding pattern of carbon. 3.6. RESULTS 161

Figure 3.11: Alanine zwitterion atomic labelling.

Finally, we use an in vacuo MD trajectory of zwitterionic alanine (See Figure 3.11 for an overview of the atomic labelling scheme used herein) to parameterise the temperature used for conformational sampling. We have selected a number of prominent degrees of freedom whose sampling ranges we will benchmark against. The MD trajectory was analysed with the in-house software hermes[276]. The bond stretching degrees of freedom that we use are N1-H2, C1-C2, N1-C1 and C3-O1. The valence angle bending degrees of freedom that we use are O1-C3- O2, H2-N1-H3, N1-C1-C3 and H5-C2-H6. The sampling ranges of these degrees of freedom are presented in Tables 3.8 and 3.9 as the sampling temperature is increased. Also provided are the ranges taken by these degrees of freedom over the course of the MD trajectory. 162 CHAPTER 3. CONFORMATIONAL SAMPLING

Sampling Range (A)˚ Temperature N1-H2 C1-C2 N1-C1 C3-O1 300 [1.08 − 0.98] [1.59 − 1.48] [1.56 − 1.47] [1.29 − 1.24] 600 [1.10 − 0.97] [1.62 − 1.47] [1.58 − 1.46] [1.30 − 1.23] 900 [1.13 − 0.95] [1.64 − 1.45] [1.60 − 1.44] [1.31 − 1.22] 1200 [1.17 − 0.94] [1.65 − 1.44] [1.61 − 1.44] [1.31 − 1.21] 1500 [1.20 − 0.94] [1.67 − 1.43] [1.63 − 1.43] [1.32 − 1.21] 1800 [1.24 − 0.93] [1.68 − 1.42] [1.64 − 1.42] [1.32 − 1.20] MD [1.15 − 0.92] [1.65 − 1.43] [1.63 − 1.35] [1.35 − 1.19] [1.11 − 0.96] [1.61 − 1.45] [1.55 − 1.43] [1.30 − 1.21]

Table 3.8: Sampling ranges of four prominent bond stretching degrees of freedom in zwitterionic alanine as the sampling temperature is increased. The ab initio Hessian and Jacobian are taken from gas phase frequency calculations. In the bottom row, we include the ranges of these degrees of freedom as obtained from a 300 K in vacuo MD trajectory. In red, we have refined these sampling ranges down so that they represent the sampling ranges taken by over 98% of conformers over the course of the MD trajectory.

We see from these tables that we recover the upper bound on the sampling range for the majority of the degrees of freedom by 900-1200 K, but do not recover the lower bound, if at all, even by a temperature of 1800 K. However, we have also presented the ranges of sampling from the MD trajectory by over 98% of the conformational ensemble, and we find these ranges to be much more agreeable with our results. Indeed, the majority of ranges are recovered by a temperature of roughly 900 K, which is entirely in keeping with the results we obtained for the alkanes.

We propose that the reason for the full MD sample spanning a larger range than that obtained by tyche is that the molecule is permitted to undergo large con- formational changes over the course of an MD trajectory. As we have mentioned, 3.6. RESULTS 163

Sampling Range (degrees) Temperature O1-C3-O2 H2-N1-H3 N1-C1-C3 H5-C2-H6 300 [131.5 − 124.8] [120.2 − 99.9] [115.8 − 107.7] [116.8 − 101.1] 600 [132.7 − 123.3] [125.2 − 96.6] [117.5 − 106.1] [120.2 − 98.2] 900 [133.6 − 122.1] [128.9 − 94.0] [118.9 − 104.9] [122.8 − 96.0] 1200 [134.3 − 121.2] [132.1 − 91.9] [120.1 − 103.9] [124.9 − 94.1] 1500 [135.0 − 120.3] [134.8 − 90.0] [121.1 − 103.0] [126.7 − 92.4] 1800 [135.6 − 119.6] [137.2 − 88.4] [122.0 − 102.2] [128.4 − 90.9] MD [134.3 − 115.8] [128.3 − 90.2] [117.6 − 97.3] [126.2 − 92.8] [131.0 − 120.6] [123.2 − 96.0] [117.3 − 103.4] [121.3 − 95.4]

Table 3.9: Sampling ranges of four prominent valence angle bending degrees of freedom in zwitterionic alanine as the sampling temperature is increased. The ab initio Hessian and Jacobian are taken from gas phase frequency calculations. In the bottom row, we include the ranges of these degrees of freedom as obtained from a 300 K in vacuo MD trajectory. In red, we have refined these sampling ranges down so that they represent the sampling ranges taken by over 98% of conformers over the course of the MD trajectory. tyche is far more restrictive when seeded with a single conformer, and so does not permit such large-scale conformational changes. Allowing the system to adopt conformations that differ significantly from the seeding geometry results in a num- ber of atomic interactions not present at the seeding geometry. For instance, a steric clash in one conformation may weaken an interatomic bond in the molecule, and grant the bond a great deal more flexibility (or restrict the bond further), and subsequently lead to an entirely different sampling range. Alternatively, an in- tramolecular hydrogen bond in one conformation may restrict a number of degrees of freedom relative to the conformer with no intramolecular hydrogen bond.

To address this issue, we require the use of a number of seeding geometries to permit the exploration of a greater extent of conformational space. This will be undertaken in Section 3.6.2. However, 900 K appears to be a sensible choice of temperature to reproduce the vast majority of sampling ranges we have encoun- 164 CHAPTER 3. CONFORMATIONAL SAMPLING tered across a wide variety of molecular systems, and so we advocate its use when sampling with tyche.

As a matter of interest, we have also acquired an explicitly solvated MD trajectory of zwitterionic alanine with TIP3P explicit water molecules. The presence of water should dampen down the electrostatic interaction between atoms, and therefore alter the range of vibrations for a number of degrees of freedom. To sample a con- formational ensemble that is representative of a solvated system with the tyche methodology, we would ideally require an ab initio Hessian and Jacobian from the explicitly solvated system. However, this is obviously not feasible for systems involving even a moderate number of explicit water molecules.

As such, we have computed the ab initio Hessian and Jacobian for the zwitterionic alanine embedded in a polarisable continuum implicit solvation field ( = 80). We expect this field to dampen the motion of any interactions involving partially charged atoms. As such, we anticipate that the upper bounds to the sampling ranges of the previously sampled degrees of freedom will be shifted higher. The sampling of these degrees of freedom are presented in Tables 3.10 and 3.11. 3.6. RESULTS 165

Sampling Range (A)˚ Temperature N1-H2 C1-C2 N1-C1 C3-O1 300 [1.09 − 0.98] [1.58 − 1.49] [1.56 − 1.48] [1.30 − 1.25] 600 [1.15 − 0.97] [1.60 − 1.48] [1.58 − 1.47] [1.32 − 1.24] 900 [1.21 − 0.95] [1.62 − 1.47] [1.60 − 1.46] [1.33 − 1.23] 1200 [1.27 − 0.94] [1.64 − 1.46] [1.61 − 1.45] [1.34 − 1.22] 1500 [1.32 − 0.93] [1.65 − 1.45] [1.62 − 1.44] [1.35 − 1.22] 1800 [1.38 − 0.93] [1.66 − 1.44] [1.63 − 1.44] [1.35 − 1.21] MD [1.13 − 0.97] [1.68 − 1.40] [1.63 − 1.36] [1.33 − 1.19] [1.12 − 0.97] [1.62 − 1.47] [1.58 − 1.43] [1.32 − 1.22]

Table 3.10: Sampling ranges of four prominent bond stretching degrees of freedom in zwitterionic alanine as the sampling temperature is increased. The ab initio Hessian and Jacobian are taken from implicitly solvated (CPCM) frequency cal- culations. In the bottom row, we include the ranges of these degrees of freedom as obtained from a 300 K TIP3P-solvated MD trajectory. In red, we have refined these sampling ranges down so that they represent the sampling ranges taken by over 98% of conformers over the course of the MD trajectory. 166 CHAPTER 3. CONFORMATIONAL SAMPLING

Sampling Range (degrees) Temperature O1-C3-O2 H2-N1-H3 N1-C1-C3 H5-C2-H6 300 [130.9 − 125.9] [117.2 − 101.2] [115.2 − 107.1] [116.4 − 101.0] 600 [131.8 − 124.8] [120.9 − 98.6] [116.7 − 105.4] [119.5 − 98.0] 900 [132.5 − 123.9] [123.8 − 96.7] [117.8 − 104.0] [121.7 − 95.8] 1200 [133.1 − 123.1] [126.2 − 95.2] [118.8 − 102.9] [123.6 − 93.9] 1500 [133.7 − 122.4] [128.3 − 93.9] [119.6 − 101.9] [125.1 − 92.3] 1800 [134.1 − 121.8] [130.1 − 92.8] [120.4 − 101.0] [126.5 − 90.8] MD [129.4 − 112.9] [122.6 − 93.6] [127.1 − 100.9] [124.7 − 91.0] [127.3 − 114.1] [122.0 − 97.2] [120.3 − 105.3] [120.9 − 95.6]

Table 3.11: Sampling ranges of four prominent valence angle bending degrees of freedom in zwitterionic alanine as the sampling temperature is increased. The ab initio Hessian and Jacobian are taken from implicitly solvated (CPCM) frequency calculations. In the bottom row, we include the ranges of these degrees of freedom as obtained from a 300 K TIP3P-solvated MD trajectory. In red, we have refined these sampling ranges down so that they represent the sampling ranges taken by over 98% of conformers over the course of the MD trajectory. 3.6. RESULTS 167

For a number of the degrees of freedom, we see that there is typically a poor overlap between the sampling ranges obtained from tyche and the MD trajectory until the tyche sampling temperature is set excessively high. Once again, this is improved significantly by restricting the MD sampling range to include 98% of conformers from the MD trajectory, at which point the optimal tyche sampling temperature appears to be roughly 900 K, in keeping with our previously acquired results. However, this restriction is not as successful as when implemented with the in vacuo conformers. For example, the N1-H2 bond stretching and O1-C3-O2 valence angle bending sampling ranges still do not match up with those obtained through tyche.

We note that those degrees of freedom with poor disagreement with the tyche conformational ensemble involve the more partially charged atoms, i.e. the zwitte- rionic components of the molecule. One can assume that our inability to correctly sample these degrees of freedom correctly is resultant from the lack of explicit solvent interactions with the tyche sampling. In an explicitly solvated environ- ment, these atoms predominantly form hydrogen bonds with the solvent, which subsequently impacts on the sampling ranges of the internal degrees of freedom involving these atoms. One is then unable to recreate these sampling ranges with tyche unless some explicit solvation is included. However, it is certainly encour- aging that we can recover the sampling ranges of the other degrees of freedom quite readily, suggesting that tyche, parameterised with an implicitly solvated Hessian and Jacobian, is readily adaptable to solvated systems that don’t interact with the solvent as strongly. 168 CHAPTER 3. CONFORMATIONAL SAMPLING

3.6.2 Markov Chain Conformational Sampling

Previous work in our group has located the local minima of each of the naturally occurring amino acids in vacuo[277]. In the referenced work, the amino acids are “capped” with peptide bonds at both the N- and C-terminii. The capping re- moves the capacity for the amino acids to take zwitterionic forms. For capped alanine, eleven distinct energetic minima were found. So that we can compare with the MD trajectory obtained for zwitterionic alanine in the previous section, we have removed the capping groups from the capped alanine minima. The zwitte- rionic conformers were subsequently optimised at the B3LYP/aug-cc-pVDZ level of theory with implicit solvation so that the zwitterion remains stable. Upon opti- misation, we have performed single point frequency calculations on the optimised conformers in vacuo to obtain the gas phase Hessian and Jacobian. tyche sampling has been undertaken at a temperature of 900 K, to remain consis- tent with our findings in Section 3.6.1. To allow for transition between the seeding geometries, we have selected a transition temperature of 300 K, since we see no reason why our LW approximation should require a higher temperature. Out of the original eleven minima, five were found to have Boltzmann weights exceeding 10−3. The remaining six minima were discarded from our sampling as seeding geometries.

In Tables 3.12 and 3.13, we present the sampling ranges of the bond stretching and valence angle bending degrees of freedom used in Section 3.6.1 as obtained from a single seeding geometry (tyche (Single)), from the five seeding geometries found to have a Boltzmann weight greater than 10−3 (tyche (Multi)), and from the MD trajectory of zwitterionic alanine. We make no reference to the restricted sampling ranges invoked in 3.6.1 since the multiple seeding geometries grant the 3.6. RESULTS 169 capacity to undertake a full conformational sampling.

Sampling Range (A)˚ N1-H2 C1-C2 N1-C1 C3-O1 tyche (Single) [1.13 − 0.95] [1.64 − 1.45] [1.60 − 1.44] [1.31 − 1.22] tyche (Multi) [1.19 − 0.91] [1.63 − 1.45] [1.59 − 1.37] [1.32 − 1.19] MD [1.15 − 0.92] [1.65 − 1.43] [1.63 − 1.35] [1.35 − 1.19]

Table 3.12: Sampling ranges of four prominent bond stretching degrees of freedom in zwitterionic alanine, as obtained through tyche with a single seeding geometry (tyche (Single)), with five seeding geometries (tyche (Multi)) and through a MD trajectory.

Sampling Range (degrees) O1-C3-O2 H2-N1-H3 N1-C1-C3 H5-C2-H6 tyche (Single) [131.5−124.8] [120.2 − 99.9] [115.8−107.7] [116.8−101.1] tyche (Multi) [133.2−117.4] [126.0 − 91.0] [114.4 − 97.7] [123.4 − 91.8] MD [134.3−115.8] [128.3 − 90.2] [117.6 − 97.3] [126.2 − 92.8]

Table 3.13: Sampling ranges of four prominent valence angle bending degrees of freedom in zwitterionic alanine, as obtained through tyche with a single seeding geometry (tyche (Single)), with five seeding geometries (tyche (Multi)) and through a MD trajectory.

From both Tables 3.12 and 3.13, we see that the inclusion of more than a single seeding geometry results in an excellent agreement of sampling ranges with those of the MD trajectory. In contrast, with a single seeding geometry, we found in Section 3.6.1 that the lower bounds of the sampling ranges were poorly reproduced, 170 CHAPTER 3. CONFORMATIONAL SAMPLING and the upper bounds of the sampling ranges were only found to be in agreement with the MD upper bound at very high temperatures.

We omitted reporting the sampling of the dihedral degrees of freedom in Section 3.6.1 since we anticipated a single seeding geometry to be inadequate for sampling these degrees of freedom. Typically, the dihedral degrees of freedom adopt a number of rotamer conformations that are similar in energy but distinct in terms of conformation. A single LW is incapable of capturing such broad sampling ranges.

In the following, we analyse the sampling of the H1-N1-C1-C2 dihedral degree of freedom of zwitterionic alanine. From our analysis, we have found this to be the most widely varying dihedral angle over the course of the MD simulation, and so it ideal for our purposes. The tetrahedral amino group is able to freely rotate in the zwitterion, and so we expect three distinct maxima in the sampling frequency at −120◦, 0◦, 120◦. Conformational studies of peptides are conventionally validated with Ramachandran plots, which report on the peptide backbone dihedral angles. A Ramachandran plot, is, however, of little use when working with a single amino acid zwitterion since the terminal groups can freely rotate, and so the majority of the Ramachandran plot is typically sampled.

In Figure 3.12, we present histograms showing the sampling frequency of the pos- sible H1-N1-C1-C2 dihedral angles of zwitterionic alanine. The red and blue traces correspond to the sampling obtained through tyche and the MD trajectory, re- spectively. The MD trajectory appears to recover the three rotamers, albeit the peaks are extremely broad, particularly the one centred at 0◦. Indeed, virtually the entire range of values the dihedral degree of freedom can take is sampled, which implies that a small number of seeding geometries will be unable to recover the sampling obtained through MD. When seeding with the gas-phase minima, the three rotamers are sampled, but there is very little sampling obtained about the 3.6. RESULTS 171 seeding geometries- roughly a range of 40◦ about each rotamer. It then seems as if seeding with local minima is inadequate for an extensive sampling of conforma- tional space.

Figure 3.12: Histograms for the sampling of the alanine sidechain dihedral angle by use of the various tyche sampling schemes (by seeding with energetic min- ima (red), MD snapshot conformations (green) and the selective mode distortion (green)), and the MD sampling (blue).

Tyche (Minima Seeding) MD Trajectory Tyche (Selective) Tyche (MD Seeding) Sampling Frequency Sampling

1 to 20 -20 to 0 21 to 40 41 to 60 61 to 80 -80 to -61-60 to -41-40 to -21 81 to 100 -100 to -81 101 to 120121 to 140141 to 160161 to 180 -180 to -161-160 to -141-140 to -121-120 to -101

Dihedral Angle (degrees)

We can artificially extend the sampling range with the same seeding geometries by increasing the sampling temperature. However, since we have already found an optimal temperature with which to sample the bond stretching and valence angle bending degrees of freedom, increasing the temperature is not a feasible approach. Increasing the temperature would lead to an oversampling of the bond stretching and valence angle bending degrees of freedom, and likely drive the system into undesirable regions of conformational space by, for example, breaking bonds.

We have instead proposed a selective sampling scheme that prevents the over- sampling of the bond stretching and valence angle bending degrees of freedom. 172 CHAPTER 3. CONFORMATIONAL SAMPLING

By decomposing each normal mode down into its constituent redundant internal coordinates, we can vibrate those modes that are dominated by dihedral degrees of freedom with a higher temperature than that used for vibrating the modes dominate by bond stretching and valence angle bending. For our experiment, we have chosen to vibrate the ten modes with the highest dihedral contributions with a temperature of 1800 K, while maintaining a temperature of 900 K for the re- maining normal modes. The sampling from this scheme is depicted by the yellow histogram in Figure 3.12.

We see that the selective high temperature sampling yields a broader sampling about the rotamer seeds than the single temperature. Indeed, the range of sam- pling about the rotamer at 0◦ is increased by roughly 40◦ relative to the single temperature sampling. However, we are unable to sample continuously across the dihedral range as accomplished by the MD trajectory. Whilst the selective high temperature sampling is certainly a useful tool if we are restricted in the number of seeding geometries available. It is encouraging that we can increase the sampling range so much from a minor adjustment to tyche.

The final strategy involves the use of more seeding geometries than can be of- fered by the local minima of the system. The power of the tyche methodology is that the seeding geometries can be obtained by any means, whether it be system- atic sampling, MD, or even random conformer generation. We have selected 100 random conformers from the MD trajectory against which we are benchmarking. The Jacobian and Hessian from each of these conformers has been evaluated at the B3LYP/aug-cc-pVDZ level of theory, and those conformers with a Boltzmann weight lower than 10−3 was discarded. This left twenty seeding geometries with which to run tyche with. Naturally, since these conformers are not stationary points on the PES, we require the non-stationary point treatment that has been 3.6. RESULTS 173 outlined in Section 3.3.

The sampling obtained from this MD-seeding approach is depicted by the green histogram in Figure 3.12. Relative to both the minima-seeding and selective high temperature approaches, we recover a greater range of dihedral sampling with the MD-seeding. However, there are a number of discrepancies between the MD tra- jectory and MD-seeding sampling regions. For example, the rotamer at 0◦ is over- sampled relative to the MD trajectory sampling. We could perhaps improve the sampling by increasing the transition temperature, subsequently including more seeding geometries into our sampling and covering a broader range of the dihedral angle. In addition, we could invoke the selective high temperature strategy, but these strategies are not investigated further here.

3.6.3 Conformational Sampling of Large Molecular Species

As a final study, we demonstrate the capacity of the tyche methodology to be used for large biomolecular systems. The dominant conformational degrees of freedom that govern the motion of large biomolecules have been shown to be contained within the low frequency normal modes[278]. These low frequency nor- mal modes have been strongly linked to enzymatic mechanisms in proteins, such as bovine pancreatic trypsin inhibitor[232], the gramicidin-A dimer channel[279], lysozyme[280], F1-ATPase[281] and ras p21[282]. Indeed, vibrations have been pro- posed to be driving forces for coupling energy release to conformational changes. The energy is thought to be transferred by means of a soliton along α-helical motifs[283, 284]. The soliton is thought to propagate along the chains formed along the helix axis originating from peptide backbone interactions. Excitons have also been proposed to be responsible for the coupling of photoexcitation to 174 CHAPTER 3. CONFORMATIONAL SAMPLING conformational changes[285].

Figure 3.13: Helix stretching and bending modes of motion for the alanine- isoleucine helix.

A simple polypeptide system that exhibits large scale motion is the α-helix motif. The α-helix is a spiral conformation of a peptide chain, in which the carbonyl and amide groups from the peptide backbone interact along the helix axis and stabilise the conformation. The α-helix is one of the more stable secondary structure motifs, and is a highly important feature for a number of protein functionalities[286, 287, 288]. Two large-scale conformational motions have been identified in the past; the helix “breathing” mode[289], in which there is a stretching along the helix axis, and the helix “bending” mode[290], where the helix bends about its midpoint. Our test system will be a decapeptide, in which alanine resides sandwich an isoleucine at the centre of the helix. This system is depicted in Figure 3.13.

To track the helix breathing, we use the distance between the Cα atoms at the N- and C-termini of the helix. For the helix bending, we use the angle subtended by the arc of the helix breathing degree of freedom we have just specified, from the 3.6. RESULTS 175

Cα of the isoleucine residue. We present the evolution of these degrees of freedom in Figures 3.14 and 3.15, where the twenty modes with the highest bond stretch, valence angle or dihedral angle contributions are evolved.

Figure 3.14: Helix breathing motion recovered from selectively evolving those modes comprising dihedral (red), valence angle (green) and bond stretching (blue) degrees of freedom.

17.64

17.63

17.62

17.61

17.6

17.59

17.58 Helix Stretch (Angstroms) Stretch Helix

17.57

17.56

17.55 0 50 100 150 200 250 300

Conformer

Bonds Angles Dihedrals

We see that for both the helix breathing and bending, sampling those modes that are dominated by dihedral degrees of freedom yields a far greater range of con- formers than sampling those modes that are dominated by valence angle and bond stretch degrees of freedom. This is to be expected, since the dihedral-dominated modes are typically far lower in frequency than the other modes. Our result is then in keeping with studies that have found large scale conformational changes to be dominated by low frequency modes of motion, as we have mentioned. We do note that the range of sampling is not extensive. However, this results from the harmonic approximation we have employed in expanding about the seeding 176 CHAPTER 3. CONFORMATIONAL SAMPLING

Figure 3.15: Helix bending motion recovered from selectively evolving those modes comprising dihedral (red), valence angle (green) and bond stretching (blue) degrees of freedom.

137.7

137.6

137.5

137.4

137.3

137.2 Helix Bend (degrees) Helix

137.1

137

136.9 0 50 100 150 200 250 300

Conformer

Bonds Angles Dihedrals conformation; the walls of the well become steep rather quickly, and so vibrations about a given point are limited in the amount of conformational space available. One can in principle undertake a much greater conformational search by using a number of additional seeding geometries.

3.7 Conclusion

We have presented a conformational sampling methodology that is both computa- tionally tractable, and in principle, more representative of the sampling one may obtain from the ab initio PES than one constructed by a classical force field.

The tyche methodology is ideal for the construction of conformers with which 3.7. CONCLUSION 177 a machine learning model can be constructed. It naturally implements a form of importance sampling in the form of transitioning between seeding geometries based on Boltzmann weights. It also provides both large-scale conformational sampling, in addition to finer sampling of the higher frequency degrees of freedom. These properties make it an ideal conformational sampling tool for the construction of machine learning models.

Regarding future work, we suggest that the RIC selective sampling requires further development. For instance, one could define separate temperatures for various sub- sets of modes dominated by certain valence coordinate contributions, e.g. a high temperature for those modes that are dominated by dihedral degrees of freedom, and a lower temperature for those modes that are dominated by bond stretching degrees of freedom.

It may perhaps be feasible to approximately update the Hessian of the system with respect to a distortion. Hessian updating methods are frequently invoked during ab initio geometry optimisations[291], and for small displacements are relatively accurate. By use of these Hessian-updating methodologies, one could re-evaluate the normal modes of the system once it has been displaced from the seeding ge- ometry, and subsequently vibrate along these new normal modes. However, it is anticipated that such a methodology would significantly increase the computa- tional power required for sampling.

Of course, the sample set as output from tyche is not ideal for the immediate construction of a machine learning model. One must tailor the sampling to op- timise both the predictive capacity and computational tractability of a resultant kriging model. This requires some form of processing of the sample set produced by tyche, which we detail in Chapter5. Chapter 4

A Novel Carbohydrate Force Field

4.1 Introduction

The computational analysis of biochemical systems is largely biased towards pep- tides and proteins. One may therefore be forgiven for assuming that they possess a near monopoly in biochemistry. However, it is only by complexation with ad- ditional molecular species that peptides and proteins are able to accomplish their myriad roles within biological systems[292]. For example, eukaryotic proteins are subject to post-translational modification, a process in which various carbohydrate sequences are attached to the protein. Vital chemical entities, such as enzymatic cofactors (e.g. ATP, NADP, etc.) and nucleotides, are entirely dependent upon the existence of products from the pentose phosphate pathway, which is synthesised from carbohydrates. In fact, the phosphate pathway would be unable to run at all if it were not for the energy derived from carbohydrates, which undergo glycolysis

178 4.1. INTRODUCTION 179 and are subsequently passed into the tricarboxcylic acid cycle.

Many biochemical force fields are parameterised by exhaustively sampling quanti- ties arising from peptide atom types. One cannot simply use protein atom types as a direct substitution for carbohydrate atom types for a number of reasons:

1. Peptides possess features that are generally absent in carbohydrates, most prominently the presence of nitrogen and the ability to form structural mo- tifs. Both of these features are particularly perturbative to electrostatic quantities associated with constituent atoms. For example, nitrogen pos- sesses a significant quadrupole moment, which influences other atomic elec- trostatic quantities anisotropically, and cannot be captured by the standard point charge approximation. Equivalently, structural motifs such as helices and sheets are stabilised by vast intermolecular bonding networks. This de- pendence on structural motifs necessarily influences the properties of other atoms by constraining them to states that do not necessarily coincide with those of the unfolded non-native state. Although carbohydrates do form structural motifs, they tend to remain flexible under standard biological con- ditions, and do not typically assemble into the stable secondary structures, which polypeptides do.

2. Carbohydrates exhibit far more conformational freedom than peptides. Elec- tronic quantities vary as a function of the conformational degrees of freedom of a molecular species[109]. As such, the electronic quantities of a more flex- ible conformation will vary to a greater extent than those of a less flexible one.

3. Many carbohydrate species exhibit a preference for axial rather than equato- rial arrangements of electron-rich substituents on an anomeric carbon. The 180 CHAPTER 4. A NOVEL CARBOHYDRATE FORCE FIELD

origin of this anomeric effect is not entirely clear[293], but the energetics that arise from it must be captured by a force field which deals with carbo- hydrates. Similarly, the exo-anomeric effect, which deals with substituents linked to an anomeric oxygen, forces a separate conformational preference. It is not within the scope of this work to deal with the anomeric and exo- anomeric effects in any great detail, and so we refer the interested reader to a more rigorous overview[294].

4. Similar to the anomeric effect, the gauche effect can arise within a number of carbohydrate species and bias rotamer preferences[295]. This effect has an ambiguous origin. Research has suggested it to arise from hyperconjugation or solvent effects. In short, it is the preference of a gauche rotamer over an anti rotamer, where the latter would be stereoelectronically preferable. Work by Kirschner and Woods[296] proposed that the gauche effect results from solvent effects, which are not important in our work since we are explicitly dealing with gas phase molecules. However, for a carbohydrate force field to be of use, this effect must be accounted for.

The conformational freedom of carbohydrates renders them somewhat troublesome for experimentalists, as they prove to be highly difficult to characterise by conven- tional high-resolution structural determination techniques[297], particularly X-ray crystallography. To be precise, carbohydrates tend to be difficult to crystallise, which is problematic because X-ray diffraction techniques are a valuable source of structural information. As such, the structural characterisation of carbohydrates rests with a few experimental techniques, and subsequent validation by compu- tational means. This required harmony between experiment and computation is vastly important, and has been recently explored[298, 299], and so proves to be a fruitful avenue for development. 4.1. INTRODUCTION 181

Classical force fields such as OPLS-AA, CHARMm, GROMOS and AMBER ap- pear to have characterised carbohydrates as ‘secondary molecular species’ relative to their peptide counterparts. As such, these parameterisations resemble “bolt-on” components. However, force fields that are specifically tailored for carbohydrates do exist and have proven successful. GLYCAM[300] is perhaps the most prominent of these force fields, and has been ported to AMBER. More recently, the advent of dl field has facilitated the use of GLYCAM parameters within dl poly 4.0[301]. GLYCAM has undergone extensive validation in an attempt to demonstrate its efficacy. Several studies have focused upon its ability to reproduce conformer populations in explicitly solvated molecular dynamics simulations[302, 303], which is obviously important owing to the massive conformational freedom of carbohy- drates. The applicability of GLYCAM to larger, more biologically relevant struc- tures, such as the binding of endotoxin to recognition proteins[304] or the dynamics of lipid bilayers[305], has also been demonstrated.

GLYCAM has attempted to break the paradigm of deriving partial charges based on a single molecular configuration. Instead, it has been developed such that the partial charges are averaged over the course of a molecular dynamics simula- tion, thus (albeit simplistically) accounting for the dynamic nature of electronic properties[306]. However, it must be emphasised that GLYCAM resides within the partial charge approximation to electrostatics, thus severely limiting its pre- dictive capacity. Sugars are particularly amenable to hydration, yet a partial charge approximation to electrostatics cannot recover the directional preferences of hydrogen bond formation without the addition of extra point charges at non- nuclear positions. The isotropic nature of partial charge electrostatics is readily overcome by use of a multipole moment description of electrostatics, which nat- urally describes anisotropic electronic features such as lone pairs. The benefits of such a multipole moment description over their partial charge equivalents has 182 CHAPTER 4. A NOVEL CARBOHYDRATE FORCE FIELD been systematically demonstrated over the past 20 years in many dozens of pa- pers, recently reviewed[307]. These benefits are not necessarily outweighed by the common misconception that multipole moment implementations are compu- tationally expensive relative to their point charge counterparts. The long-range nature of point charge electrostatics, O(r−1), relative to higher order multipole moments (dipole-dipole interactions, for example, die off as O(r−3)), means this is not strictly true. Point charges require a larger interaction cutoff radius relative to higher order multipole moments, and therefore form the bottleneck in electro- static energy evaluation. Given proper handling (e.g. parallel implementation), the computational overheads associated with multipole moment electrostatics can be managed.

In the remainder of this chapter, we shall demonstrate a novel means for modelling electrostatics by use of a multipole moment expansion centred upon each atomic nucleus. The techniques we present will inherently capture the conformational dependence of these multipole moments.

4.2 A Basis Set for Machine Learning

A great deal of work has been conducted within our group on the use of an Atomic Local Frame (ALF) as a basis set for denoting molecular conformations. This basis set is ideal since it spans only the internal degrees of freedom of the molecule, making it superior to a Cartesian basis set. In constructing the ALF, one requires the definition of a local frame within which ones can express the positions of all atoms within a molecular system.

We will use R to denote atomic position vectors in some global coordinate frame. 4.2. A BASIS SET FOR MACHINE LEARNING 183

We begin by nominating an origin atom, Ωo, at position RΩo. Then, the bond vector which links the origin atom to its heaviest neighbour Ωx, RΩx − RΩo, acts as the x-axis of the local frame. Finally, the xy-plane in the local frame is selected to contain both the first and second (Ωxy) heaviest neighbours of the origin atom. When the origin atom is not of the required valency to possess two neighbours, the Cahn-Ingold-Prelog rules are invoked to select Ωxy. The z-axis is then specified by its orthogonality to the xy-plane.

n n With these quantities, we construct a rotation matrix, C : R → rζ , where R is an atomic Cartesian position vector in the (arbitrary) global frame of reference, and rζ is the corresponding atomic Cartesian position vector in the ALF. The components of C are constructed by standard geometric functions, requiring three quantities

rΩx = ||RΩx − RΩo|| (4.2.1)

rΩxy = ||RΩxy − RΩo|| (4.2.2) ! RΩx · RΩxy χΩ = arccos (4.2.3) rΩxrΩxy where rΩx is the distance from Ωo to Ωx, rΩxy is the distance from Ωo to Ωxy, and χΩ is the angle between the two vectors (RΩx −RΩo) and (RΩxy −RΩo). Therefore, three degrees of freedom are required to construct a local axis system centred on Ωo, the ALF. 184 CHAPTER 4. A NOVEL CARBOHYDRATE FORCE FIELD

Figure 4.1: The Atomic Local Frame for the carbon atom of methanol. From the global coordinate system, x, y, z, one is able to define an ALF, the basis vectors of Ωo Ωx Ωxy which are denoted xALF , yALF , zALF , from the vectors R , R , R . Any atom not defining the ALF, e.g. Rn, is defined in the ALF with the spherical polar coordinates rn, θn, φn.

For all other atoms in the system, n∈ / {Ωo, Ωx, Ωxy}, positions are given by the triplet of spherical polar coordinates, {rn, θn, φn}, i.e. the radial distance, colatitudinal and azimuthal angles, respectively. Then, we see that 3N −6 degrees of freedom are required to describe a molecule in the ALF. We note that we could have simply used Cartesian position vectors for all such atoms not defining the ALF and obtain a basis of the same dimensionality. However, we find the spherical polar coordinates more physically intuitive. Of course, in doing so, our basis vectors 4.3. COMPUTATIONAL DETAILS 185 do not have the same units (the radial distance has units of distance while the colatitudinal and azimuthal angles have units of radians, and are periodic), but we have found that this does not affect our methodology.

4.3 Computational Details

The workflow proposed below essentially takes an ensemble of configurations as in- put, and outputs kriging models for the variation in the atomic multipole moments as a function of the configuration of the system:

1. The test system is sampled in accordance with the methods outlined in Sec- tion3. The general idea is to sample as much of conformational space as would form an ensemble for the true physical system along to the course of a dynamical trajectory. This subsequently allows for the formation of a kriging model that will be used in a purely interpolative fashion.

2. Single-point calculations are performed on each sample and the resultant ab initio electron density is partitioned by QCT software. The multipole moments of the topological atoms are subsequently obtained. This allows a kriging model to evaluate a functional form corresponding to the evolution of the various multipole moments as a function of conformation.

3. The sample set is split into a non-overlapping training set and test set. The training set is utilised for training of our kriging models, i.e. these are the points that the kriging function must pass through. The test set is not trained for, but is used after the construction of the kriging models to evaluate the errors associated with their predictions. 186 CHAPTER 4. A NOVEL CARBOHYDRATE FORCE FIELD

4. A kriging model is built for each multipole moment of each atom, which allows for the generation of a smooth interpolative function, mapping the evolution of the multipole moment against the molecular conformational de- grees of freedom. By use of particle swarm optimisation, we optimise our kriging parameters, to obtain an optimal kriging model.

5. The kriging models are assessed by making them predict the multipole mo- ments for each atom in a system whose configuration has not been used for training of the kriging model. However, we do possess the ab initio multipole moments for this configuration. As such, we evaluate the energy associated with all 1 − 5 (i.e. two nuclei separated by four bonds) and higher (1 − n, n ≥ 5) order interatomic electrostatic interactions as given by the predicted multipole moments from the kriging models, and the equivalent energy as given by the (exact or original) ab initio atomic multipole moments. We sub- sequently assess the deviation of the kriging predictions from the ab initio electrostatic energies.

The choice of 1 − n (n ≥ 5) interactions over the conventional 1 − n (n ≥ 4) is discussed in Section 2.5.2, and is used to prevent a divergence in the multipole moment interaction energy.

Note that the electrostatic energy[213] is the final arbiter in the validation of the kriging models, rather than the atomic multipole moments themselves, which are the kriging observations. It is worth recalling that every prediction point from a kriging model possesses an associated variance, i.e. uncertainty in the predicted value. We have in the past investigated the magnitude of these variances, but they are orders of magnitude smaller than the actual predicted value, and so inconse- quential. We thus proceed under the assumption that uncertainties in predicted 4.3. COMPUTATIONAL DETAILS 187 values are negligible. The molecular electrostatic energy is calculated by a well- known multipolar expansion[77] involving a multitude of high-rank atomic multi- pole moments[208]. This expansion is truncated to quadrupole-quadrupole (L = 5) and rank-equivalent combinations (dipole-octopole and monopole-hexadecapole).

Whilst somewhat indirect, the validation through energy rather than multipole moment has a twofold purpose. Primarily, the energy is the quantity that will be used for dynamical simulations, and so is the ultimate descriptor that we wish to evaluate correctly. Secondly, the alternative would be to assess the predictive ca- pacity of each individual kriging model. For a system with any sizeable number of atoms, where each atom has 25 individual multipole moment kriging models, the data analysis obviously becomes overwhelming. However, this analysis is unnec- essary owing to the uniqueness of the Taylor expansion from which the multipole moments arise. Since the electrostatic energy is computed from two such unique series, then if the electrostatic energy is correctly predicted, the multipole mo- ments must also be correct by deduction. Note that this consideration is valid for a single atom-atom interaction.

In order to gauge the models’ validity, we plot a cumulative distribution of errors (CDE). The CDE plots the absolute deviation of the predicted energy from the ab initio energy, predicted from the ab initio multipole moments, after having evaluated the multipole moment interactions. Put more precisely, the predicted multipole moments form an energy that is subtracted from an energy obtained from the ab initio moments. Then the absolute value of this difference is taken. Hence compensation of errors is not allowed because first the difference is taken and then the absolute value. All errors are subsequently summed to give an “energetic deviation”. These energetic deviations are plotted against the percentile of test configurations that fall on or below the given energetic deviation. The CDE is 188 CHAPTER 4. A NOVEL CARBOHYDRATE FORCE FIELD sigmoidal (as such it is coloquially referred to as an “S-curve” in our group, but we adopt the more meaningful CDE nomenclature throughout), and therefore represents a Gaussian distribution of errors.

Our aim is then twofold: the first is to reduce the upper tail of the sigmoid such that the 100th percentile error is convergent at as low an error as possible. This corresponds to the predictions being uniformly good across the test set with no spurious predicted interactions. Our second aim is to shift the CDE as far down the abscissa as possible, which ensures the average error associated with our predictions is as low as possible. The first goal is achieved by certifying that the training points used for the construction of the kriging models form the boundaries of configurational space with respect to our sample set. This boundary checking guarantees that the kriging model is being asked to interpolate from training data. Boundary checking is discussed in Section 5.5.

The second goal is attained by consistent improvement of the kriging engine, and making sure that the test points are uniformly close to the training data, allowing for efficient interpolation. Note that we do not necessarily choose test points that are close to the training points. In other words, the training and test sets are con- structed independently. By ensuring that the training data is uniformly distributed throughout the sampling domain, the average “distance” between an arbitrary test point and a training point will equal that between some other arbitrary test point and another training point. This guarantees no spurious predictions in under- trained regions of conformational space. Of course, it is prudent to invoke some form of importance sampling, which yields a greater sampling density in more “im- portant” regions of conformational space. We discuss the methodologies relating to training set construction in Section5.

We work on the tetrose diastereomers erythrose and threose, the smallest carbo- 4.3. COMPUTATIONAL DETAILS 189 hydrates that adopt open chain and furanose forms. The particular conformations studied are given in Figure 4.2. Energetic minima were provided by Prof Ibon Alkorta, who had previously conducted a PES scan of these species[308], and are reported more thoroughly elsewhere[309]. All geometries were subsequently op- timised by the program gaussian03[310] at the B3LYP/6-311++G(d,p) level of theory. The Hessian of each geometry has been computed, and utilised in the tyche procedure described in Section3, allowing for the output of 2000 geome- tries for each system. Single point calculations were performed on each sample at the B3LYP/apc-1[311] level of theory. The apc-1[311] basis set (which is a polarization-consistent (pc) double-ζ plus polarization basis set with diffuse func- tions) was used for the DFT calculations, since this family of basis sets has been specifically optimized for DFT. The resultant wavefunctions are then passed on to the program aimall[312], which calculates the atomic multipole moments by application of QCT.

We are only interested in the internal degrees of freedom of the molecular config- urations. Hence the atomic multipole moments must be expressed in an atomic local frame (ALF) rather than in the global frame. This procedure makes sure that the kriging focuses on the variation of atomic multipole moments within the molecule. Otherwise, when referring to the global frame (rather than the ALF), the three components of an atomic dipole moment, for example, vary upon rigid rotation of the whole molecule. Training for such a variation is useless. The details of the atomic local frame chosen for our work are outlined in Section 4.2.

Kriging of each multipole moment (up to the hexadecapole moment) was performed for each atom by the in-house program ferebus 1.4 and models consisting of N training examples were generated. Atom-atom interaction energies of 1 − 5 and higher were computed in a test set of 200 arbitrary conformations, using the ab 190 CHAPTER 4. A NOVEL CARBOHYDRATE FORCE FIELD

Figure 4.2: Erythrose and Threose in the open chain and furanose forms. 4.3. COMPUTATIONAL DETAILS 191 initio multipole moments, and compared to the interaction energies from multipole moments predicted by the kriging models. Errors are given in the form of so-called S-curves, which map the percentile of conformations within the test set, predicted up to a maximum error chosen, which is read off on the abscissa.

Here we give a brief overview of how the force field we propose could be utilised within the context of MD. We limit the discussion to the evaluation of electrostatic interactions, but work is currently being undertaken within our group to establish the framework for an entire force field[313], which deviates considerably from the terms arising in a classical force field. The non-electrostatic terms are also obtained via the QCT partitioning of molecular energy, given in Section 2.2.3. At designated points over the course of a MD simulation, the conformational state of the system is evaluated. At this point, atomic multipole moments, up to the hexadecapole moment, can be extracted from the kriging models, and subsequently utilised for the evaluation of interatomic electrostatic interactions. In this way, we capture the conformational dependence of the atomic multipole moments. Separate kriging models are obtained for the non-electrostatic terms, that is, the intra-atomic energy (both kinetic[212] and potential), the short-range interatomic Coulomb energy not obtained by multipolar expansion, and the interatomic exchange energy.

The issue of coordinate frames needs further clarification because the Cartesian MD frame coordinate system is not the same as the local coordinate frame within which we have evaluated the atomic multipole moments. Prior to invoking a krig- ing model for the evaluation of multipole moments, the conformational state of the system must be converted from a Cartesian frame of reference to an ALF. From here, atomic multipole moments can be evaluated corresponding to the current state of the system. Previous work has derived the forces arising from the inter- actions of atomic multipole moments within the Cartesian frame of reference[146], 192 CHAPTER 4. A NOVEL CARBOHYDRATE FORCE FIELD which requires the partial derivatives of the multipole moments with respect to the ALF degrees of freedom. These terms have an analytical functional form, and so computation of the forces in the Cartesian frame of reference can be per- formed explicitly. These forces can subsequently be utilised by the standard MD procedure.

An embryonic workflow has been integrated within the MD package dl poly 4.0. Currently, an atomic kriging model is loaded into memory and the relevant data stored, followed by removal of the kriging model from memory. Memory requirements are discussed in AppendixB. Of course, memory management is crucial to the speed of the proposed methodology, and so will require a great deal of fine-tuning. However, much speedup can undoubtedly be accomplished by a number of techniques, e.g. caching of regularly used quantities and parallel implementation.

Finally, it is too early to extensively comment on the computational cost of the current approach. It would be naive to directly compare the flop count of the current force field with a traditional one without appreciating that (i) extra non- nuclear point charges are needed to match the accuracy of multipole moments and the former propagate over long range, (ii) multipolar interactions drop off much faster than 1/r, depending on the rank of the interacting multipole moments, which depends on the interacting elements themselves (see extensive testing in the protein crambin[118]), (iii) the efficiency of the multipolar Ewald summation[314] that is being implemented in dl poly 4.0, (iv) the dominance of monopolar interactions at long range (vast majority of interactions) and the (v) outstanding fine-tuning of the kriging models at production mode. The current force field may well be an order of magnitude slower than a traditional force field. This estimate and the fact that the current force field contains electronic information invite one to compare 4.4. RESULTS AND DISCUSSION 193 its performance with on-the-fly ab initio calculations instead.

4.4 Results and Discussion

4.4.1 Single Minimum

The lowest energy conformer for each system was chosen as an input structure for sampling. CDEs were subsequently generated for each of these training sets by the methodology outlined in the previous section. The CDEs are given in Figure 4.3, and the accompanying mean errors in Table 4.1.

Mean Error (kJ mol-1) Open Chain α-furanose β-furanose Erythrose 0.27 1.32 1.30 Threose 0.34 0.83 1.47

Table 4.1: Mean errors associated with the CDEs given in Figure 4.3.

The first point to notice is that the open chains for both erythrose and threose are modelled by kriging to a significantly better standard than the furanose forms. We may, however, immediately attribute this to the number of 1-5 and higher interactions occurring in these systems. For both open chains, 25 interactions are required to be evaluated for comparison to the energies produced from the ab initio multipole moments. The numbers of interactions requiring evaluation for the furanose forms comes to 39, which is roughly 50% more than the amount evaluated in the open chain forms. We would subsequently expect a proportional 194 CHAPTER 4. A NOVEL CARBOHYDRATE FORCE FIELD

Figure 4.3: CDEs corresponding to all erythrose (top) and threose (bottom) sys- tems studied. The open chain forms are systematically better predicted than the corresponding furanose forms. 4.4. RESULTS AND DISCUSSION 195 relationship between the number of interactions required for evaluation and the mean error attributed to the kriging model. Whilst we see this to be roughly true when comparing the errors on the threose open and α-furanose forms, the errors appear disproportionately higher for the other systems.

Kriging is an interpolative technique, and so is not suited for extrapolation. How- ever, we point out that the kriging engine is still predictive for extrapolation- in this case, the prediction falls to the mean value of the function. Obviously this is not ideal for highly undulatory functions. However, considering how the atomic multipole moments do not fluctuate over vast ranges, the mean will often represent a respectable prediction to the function value. The kriging model can be refined in an iterative fashion, whereby extrapolation points are added to the training set. This is a commonly used technique in the field of machine learning. Whilst not currently implemented within our methodology, the iterative protocol is a technique which is currently being explored.

Regardless of the problem encountered in the above, we see that it is easily reme- died by strategic sampling of conformational space. In fact, these problems are ubiquitous to machine learning techniques, and have been encountered in studies which attempt to implement neural networks to predict a PES[9]. In this field, the problems have been solved to some extent by making the prediction engine issue a warning to the user that a point is being predicted which lies outside of the training set. This proves to be advantageous as the user may then recognise that the point should be included within the training set as it obviously lies within an accessible portion of conformational space. The point can then be included within the training set, and one can generate refined models by undergoing this process iteratively. 196 CHAPTER 4. A NOVEL CARBOHYDRATE FORCE FIELD

4.4.2 Training Set Size Dependency

We start by discussing the effects of increasing the training set size on the predic- tion error for a kriging model. For this purpose we use the erythrose open chain system owing to its higher conformational flexibility, which we assume amplifies the effects of training set size. Kriging models were generated for this system with training sets ranging from 700 to 1500 sampling points, in increments of 100. The same test set (of 200 points) was reserved for prediction by all models. The CDEs for this are given in Figure 4.4.

Figure 4.4: CDEs for erythrose open chain at various training set sizes. Note the progression of the CDEs towards the lower prediction errors as the training set size increases. However, owing to the logarithmic abscissa, this does not correspond to a uniform enhancement of a kriging model given a consistently larger training set size.

As expected, the prediction errors of the CDEs in Figure 4.4 systematically de- crease as the training set size is increased. In other words, the CDEs move to 4.4. RESULTS AND DISCUSSION 197 the left with increasing training set size, although this is not true for all parts of the CDEs because they clearly intersect in many places. Overall the uniform increments of 100 in training set size are not matched by equal uniform strides of improvement in CDE shape and position. An alternative way to gauge the im- provement in prediction with increasing training set size is monitoring the average prediction error for each CDE. This value cannot be read off for a CDE in Figure 4.4 but can be easily calculated.

Figure 4.5 plots the average prediction error for each CDE against increasing train- ing set size: red for “Old ferebus” and blue for “New ferebus”, a development version of our kriging engine, which differs in a number of ways to the “Old fere- bus”. We include both “Old ferebus” and “New ferebus” data to establish whether any functional forms of average prediction error against increasing train- ing set size are conserved with respect to improvements in the engine. The “Old ferebus” data show a plateau in the average error (left pane) at a training set size of about 1200, after an initial decrease in this error. This plateau would be rather problematic, as it implies some maximum efficiency of the kriging engine, beyond which there is no reward for an extension of the training set. However, this is not the case for the “New ferebus” data (right pane).

Learning theory states that for a machine learning method of this type (krig- ing), the mean prediction error should decrease asymptotically towards zero, with √ functional form A + B/n or C + D/ n, where n is the training set size, and A, B, C and D are fitted constants. Figure 4.5 plots these asymptotes, where

(Aold = 0.105; Bold = 348.11) and (Cold = −0.293; Dold = 22.06), as determined by regression analysis against the “Old ferebus” data, each with R2 coefficients of

0.93. Similarly, for the “New ferebus” data, constants of (Anew = 0.097; Bnew =

293.38) and (Cnew = −0.193; Dnew = 18.64) were obtained. These fitted asymp- 198 CHAPTER 4. A NOVEL CARBOHYDRATE FORCE FIELD totes both possess R2 values of 0.98. As such, we conclude that the decay of the mean prediction error of our machine learning method possesses, as yet, inconclu- sive functional form.

The results in Figure 4.5 are consistent with the behaviour seen in similar inter- polation methods[315]: for an infinite training set size, the mean prediction error will asymptote to zero. However, for the methodology to remain computationally feasible, some finite training set size will of course be required. So, for example, we find that for a mean prediction error of 0.3 kJ mol-1, the training set would require about 1450 sample points for either functional form taken as the decay of √ the prediction error, i.e. A + B/n or C + D/ n.

A comment on the nature of the average error is in place here. In principle, the prediction error consists of the sum of the estimation error and the approximation error. From learning theory, one expects the estimation error only to go to zero. The bias–variance decomposition of a learning algorithm’s error also contains a quantity called the irreducible error, resulting from noise in the problem itself. This error has been investigated some time ago in the context of tests on kriging of ethanol multipole moments[208] and is caused by the small noise generated by the integration quadrature of the atomic multipole moments. Secondly, any bias caused by an inherent error in the ab initio method used, compared to the best method available (e.g. CCSD(T) with a complete basis set), is not relevant in our error considerations. The reason is that we always assess the performance of kriging training against the (inevitably approximate) ab initio at hand, which we refer to the source of “the” ab initio data. 4.4. RESULTS AND DISCUSSION 199

Figure 4.5: Mean prediction errors associated with CDEs for erythrose open chain as the training set size is increased. With the old kriging engine (left), a distinct plateau formed after roughly 1200 training examples, corresponding to no further improvement in the kriging model despite additional training points. However, the new kriging engine (right) appears to avoid premature plateauing, with additional kriging model improvement at higher training set sizes. Regression fits of the ferebus√errors against training set size, of functional forms A + B/n (blue) and C + D/ n (black), where n is the training set size, are also given. 200 CHAPTER 4. A NOVEL CARBOHYDRATE FORCE FIELD

4.4.3 Multiple Minima

The amount of conformational space available to molecular systems reaches lev- els which are entirely unfeasible for systematic sampling as the number of atoms increases. As such, it becomes all the more prudent to obtain an efficient sam- pling scheme for our purposes. As we have mentioned, our sampling methodology is limited to local conformational exploration about some given input geometry, since the PES about that point is approximated as a harmonic well. As such, to thoroughly explore conformational space, our methodology requires the usage of a number of such starting geometries. Then, the molecular PES is approximated by a number of harmonic wells. If the input geometries are sufficiently close to one another, the wells will overlap, and the PES may be explored seamlessly.

For the open chain form of erythrose, 174 energetic minima were found by an exhaustive search of conformational space. Figure 4.6 plots the CDEs obtained for samples which have been generated from different numbers of up to 99 minima. The CDEs display increasingly poor prediction results as the number of starting minima increases. The actual mean errors for these CDEs are summarised in Table 4.2. This trend has a logical interpretation. As the number of seeding structures increases, the sampled conformational space grows in size. Given a fixed kriging model size, the sampling density therefore decreases. The kriging model then deviates from the true analytical function, and the results from predictions deteriorate.

Of course, thorough sampling of conformational space is an issue for parameterising any force field, and by no means one that is resultant from our methodology. We may overcome this issue in two ways. The first is the ongoing improvement of our kriging engine to deal with larger training sets comprising more molecular 4.4. RESULTS AND DISCUSSION 201

Figure 4.6: S-curves depicting the power of a kriging model as more energetic minima are utilised for conformational sampling. As the number of minima utilised increases, the CDEs tend towards higher prediction errors. The kriging models which underlie these CDEs have a fixed training set size of 700.

Number of Minima 1 20 40 60 80 99 Mean Prediction Error (kJ mol-1) 0.27 0.85 0.9 1.23 1.30 1.37

Table 4.2: Mean errors corresponding to the CDEs depicted in Figure 4.6. Note the ‘bunching’ of prediction error when 60 energetic minima or higher are used as seeds for the conformational sampling. 202 CHAPTER 4. A NOVEL CARBOHYDRATE FORCE FIELD configurations. The second is by undertaking sampling with only a subset of the energetic minima that are available. This is all the more valid an approach if most of the minima are very high in energy relative to the lowest-lying minima. These regions of conformational space will be accessed very infrequently during the course of a MD simulation, and so may be sampled much more coarsely. This selective sampling is quite readily employed, and has been discussed at length in the literature. For example, Brooks and Karplus[232] found that a comprehensive sampling of conformational space for bovine pancreatic trypsin inhibitor could be achieved by evolving only the lowest frequency normal modes of motion. Needless to say, this is readily accomplished by our sampling methodology.

4.5 Conclusion

We have demonstrated that the atomic multipole moments of a set of carbohy- drates are amenable to the machine learning technique kriging. Whilst this has been done in the past for a variety of chemical species including naturally occurring amino acids, this is the first foray into the field of glycobiology.

Kriging is able to capture the conformational dependence of the multipole moments and make predictions, such that the error in the electrostatic energy relative to that derived from ab initio data is encouraging, given the popular aim is to obtain errors below 4 kJ mol-1. Indeed, the presented methodology is immediately extensible to any term arising in an energetic decomposition of a system. If some quantity is conformationally dependent, then the dependence can be modelled by kriging.

As such, an entire force field can be parameterized by the current methodology, reproducing ab initio quantities for use in classical MD. This route is preferable 4.5. CONCLUSION 203 to the computationally intensive approach of ab initio MD. Chapter 5

Subset Selection for Machine Learning

5.1 Introduction

In AppendixB, we give an estimate for the memory requirements of a biomolecular force field based upon the methodology we have outlined. A number of factors are largely out of our control, such as the number of atom types required. However, the size of the training set used to construct a kriging model is within our control, and can be minimised given a scheme for prudent training set construction.

As we have mentioned in Section 2.4.2, the errors associated with a kriging model diminish as we increase the size of the training set. However, we are limited in our ability to improve the accuracy of a kriging model by systematically increasing the training set size because; (1) the construction of a kriging model is an O(N 3) process, and the memory requirements scale as approximately O(N 2); (2) the

204 5.1. INTRODUCTION 205 clustering of points in a training set is associated with an ill-conditioning of R, and; (3) undulant regions of the response function require denser sampling to capture complex topological features.

With regards to (1), we obviously require that the training set is as small as possible to make the resultant kriging model viable. Regarding (2), we require that training points are as mutually distant from one another as possible to prevent clustering and subsequent ill-conditioning of R. Finally, concerning (3), some regions of conformational space require a higher density of sampling to capture the undulant topology of the response function.

It is virtually impossible to satisfy each of the three conditions without a detailed knowledge of the response function we are trying to model. Unfortunately, the response function is a PES, and detailed knowledge is simply not available for all but the simplest of molecular systems. We will present a scheme in Section 5.3 that samples the response function at a great deal of points in an attempt to gain some information regarding the topology of the response function. Whilst this methodology possesses a number of caveats, it is the closest we can come to concurrently satisfying all conditions.

The methodologies of Sections 5.2 and 5.4 prioritise conditions (1) and (2) while disregarding (3) entirely for the sake of increased computational tractability. Fi- nally, in Section 5.5, we address a separate issue of ensuring the kriging model is always used in an interpolative fashion, since the extrapolative capacity of a kriging model is severely limited. 206 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING 5.2 Greedy Heuristics

Perhaps the most straightforward methodology available for optimal subset selec- tion is the greedy algorithm. A greedy algorithm parses a large number of points, the candidate set, Ω = {x1, ..., xN }, into a subset, S. S is constructed iteratively, such that points are transferred from the candidate set to the subset based on some selection function, one at a time, until |S| = n, where n < N. The premise of a greedy algorithm is that a globally optimal subset can be constructed by iterating a locally optimal selection function.

The operation of a greedy algorithm is best demonstrated in the context of graph theory. Consider a set of connected nodes about which we wish to construct the shortest possible Hamiltonian circuit[316]. The connections between nodes, or edges, possess an associated length, and it is these lengths that characterise the length of the Hamiltonian circuit. If each node is connected to every other node, then every solution is an optimal solution, and so we deal explicitly with the case where each node is only connected to a subset of the remaining nodes. We also assume that each edge links two distinct nodes, i.e. a node cannot be connected to itself. Subject to these conditions, the construction of the shortest possible Hamiltonian circuit is referred to as the travelling salesman problem.

Given an origin node, a greedy algorithm will invoke its associated selection func- tion and find the closest linked node, subject to the constraint that no nodes which have been previously “visited” can be used, a characteristic of a Hamilto- nian circuit. The closest node then acts as an origin node and the process can be iterated until we have a closed path about all nodes. It is not too difficult to find systems where the greedy algorithm fails to find an optimal Hamiltonian circuit. However, greedy algorithms are not used for their ability to find the opti- 5.2. GREEDY HEURISTICS 207 mal solution. Greedy algorithms are simple and algorithmically efficient, making them ideal candidates for approximate optimal subset selection on large, complex datasets, where systematic evaluation of each solution is not feasible.

For the construction of a maximally diverse subset S selected from the candidate set Ω, we invoke a variant of the travelling salesman problem, where instead of invoking a selection function to minimise the distance between nodes, we maximise the distance the distance between nodes. The corresponding nodes are then added to S and removed from Ω. This process is iterated until |S| = n.

For our purposes, the candidate set will comprise some large pool of molecular conformations. The subset that is generated by the greedy algorithm is then the training set (nomenclature that we adopt herein), with which a kriging model can be built and validated on an external test set, not contained within the candidate set. The choice of distance metric is arbitrary, but for our purposes we initially choose the Euclidean norm. This choice will be subject to revising in Section 5.2.4.

Regarding our expectations, a greedy algorithm should reduce the clustering of points in the training set. We anticipate the following improvements in our kriging models:

1. dispersion More disperse distribution of training points through the train- ing set, making it less probable that predictions will be made in under- represented areas of conformational space.

2. uncluttering Inhibition of the clustering of training points in the training set, improving the global predictive capacity of the kriging model.

From (1), we expect lower maximum errors attributed to a kriging model since the number of predictions in poorly modelled regions of conformational space is 208 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING reduced. However, one could also posit local regions of conformational space to be less well-represented by the training set, thereby increasing the average error associated with a kriging model. From (2), we anticipate the average error as- sociated with a kriging model will decrease since the kriging model will not be overtrained in certain regions of conformational space. Based on this qualitative analysis, we anticipate an overall decrease in the maximum error associated with a kriging model, but reserve judgement on the average error, since (1) and (2) appear to yield opposing effects. Whether the effect of one will outweigh the other is unknown.

We are unsure of the effects of dispersion and uncluttering on the MSE of pre- diction points. Since the MSE is a measure of whether the training density is appropriate in the vicinity of a prediction point, reducing the sampling density should technically increase the MSE of prediction points. However, since we an- ticipate that both dispersion and uncluttering will improve our kriging models, it is perhaps reasonable to assume that the MSE will also decrease at prediction points.

5.2.1 Experimental Details

We evaluate the performance of the greedy algorithms on a number of small molecule test systems: water, methanol, N-methyl acetamide (NMA) and glycine. Each system has a candidate set of cardinal |Ω| = 3500, with the exception of water, where |Ω| = 1000 owing to its limited number of conformational degrees of freedom. We have constructed a series of training sets from these candidate sets by the greedy algorithms, beginning at a training set size of 100 and incrementing this by 100 up to a training set size of 1000. Each of the training sets is then 5.2. GREEDY HEURISTICS 209 used to construct a kriging model, and its performance evaluated on an external testing set. The external testing set for each system comprises 500 distinct con- formations not contained within Ω. We benchmark each of the kriging models against a kriging model constructed from 1000 random samples of the candidate set, which has thus far been the conventional methodology adopted by our group in the construction of training sets.

To test the greedy subset selection algorithms, we require some function with which to conduct error analysis. For each conformation within the testing set, P AB P AB AA we predict the B Vee,exch, B Vee,coul and Ven IQA components of each atom within the molecule. With these components, we can define the total prediction error corresponding to a test conformation, E(x), by

X AB X AB AA E(x) = ∆ Vee,exch(x) + ∆ Vee,coul(x) + ∆Ven (x) , (5.2.1) B B where the ∆ implies that we take the difference between the actual and predicted values of the quantity it precedes, i.e. the error associated with that quantity. Taking the absolute value of the error in each IQA component ensures there is no cancellation of error between terms, thus making our error analysis as rigorous as possible.

All samples have been obtained using the tyche conformational sampling soft- ware, described in Section3. All ab initio calculations were performed with the gaussian09 software package, using the B3LYP/6-31G(d,p) level of theory. IQA decomposition was carried out with aimall and kriging models were obtained through ferebus. 210 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING

5.2.2 MaxMin

The MaxMin algorithm judges an optimal subset to be one that maximises the geometric distance between points. For this purpose, the selection function is a distance metric between points, d(xi, xj), and points are parsed from the candidate set to the subset in a manner which makes the subset as sparse as possible[317]. The MaxMin algorithm, then, should avoid the ill-conditioning of the R matrix as a result of constituent points being too close[196]. The steps of the algorithm are outlined in Algorithm2[318].

1. Take S = ∅ 2. Take xi and xj, where xi, xj ∈ Ω such that d(xi, xj) is maximal 3. Add xi and xj to S while |S| < n do

Find xk ∈ Ω\S, such that minxi∈S d(xi, xk) is maximum among Ω\S Add xk to S end Algorithm 2: MaxMin Algorithm

In Figure 5.1, we present the average errors associated with the prediction of the testing set for each small molecule as the training set size is increased. For each small molecule studied, we see that as the training set size is increased, the average prediction error converges almost exactly to the average prediction error obtained from the random benchmark sampling. There is, therefore, no discernible benefit associated with constructing a training set with the MaxMin algorithm in terms of average errors. We present the maximum errors associated with each kriging model in Table 5.1. 5.2. GREEDY HEURISTICS 211

Figure 5.1: Average molecular energy error, with accompanying standard devi- ations, in the testing set conformations given a training set constructed by the MaxMin algorithm from the respective candidate sets at a number of training set sizes. In each pane, the dark black line corresponds to the average molecular en- ergy error in the testing set given a training set of size 1000, generated randomly from the candidate set. 212 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING

Maximum Prediction Error (kJ mol−1) Training Set Size Water Methanol NMA Glycine 100 9.563 37.596 50.790 47.116 200 10.436 12.182 22.858 37.353 300 5.598 10.951 21.368 36.196 400 6.258 10.820 20.076 38.252 500 4.537 13.231 19.582 39.373 600 5.578 8.508 17.738 41.471 700 5.666 8.908 16.704 39.480 800 5.081 7.679 15.728 36.916 900 4.297 7.558 14.846 33.338 1000 4.495 8.256 15.796 32.683 Random 4.488 7.213 14.993 15.061

Table 5.1: Maximum prediction errors for each small molecule kriging model as the training set size is increased. Maximum prediction errors for the randomly constructed training sets are presented in the final row.

While there two examples where the maximum prediction error of a MaxMin con- structed training set is lower than that of the random benchmark (Water and NMA with training set sizes of 900), for the most part we can conclude that the MaxMin algorithm makes no consistent improvement on random subset selection. Indeed, it seems more likely that the MaxMin algorithm constructs training sets that are worse than the random benchmarks based on the maximum errors, partic- ularly glycine where the maximum error is double that of the randomly constructed training set. 5.2. GREEDY HEURISTICS 213

5.2.3 Deletion

Whilst similar in nature to the MaxMin algorithm, the greedy Deletion algorithm judges an optimal subset to be one which minimises the amount of geometric clustering between points in the subset. To achieve this, the greedy Deletion algorithm iteratively removes points from the candidate set by rejecting a point at each iteration that is closest to any other point in the candidate set, as outlined in Algorithm3.

1. Take S = Ω while |S| > n do Take xi and xj, where xi, xj ∈ S such that d(xi, xj) is minimal Determine ci = mink∈S\{i,j} d(xi, xk) and cj = mink∈S\{i,j} d(xj, xk) Remove i from S if ci < cj, or j from S if cj < ci end Algorithm 3: Deletion Algorithm

Qualitatively, it seems as if both the MaxMin and Deletion algorithms will yield the same subset by approaching the problem from different ends. Whilst the MaxMin algorithm builds a subset from the “bottom up” (i.e. adding points that are furthest from all other points), the Deletion algorithm builds a subset from the “top down” (i.e. removing points that are closest to all other points). However, the selection function is the same for both, so one may be forgiven for assuming the two algorithms to converge upon the same subset. We can demonstrate a simple example where the two give differing subsets in Table 5.2. 214 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING Solution , with the aim of constructing a 2 R Iteration 3 Iteration 2 Iteration 1 = 4. The first row, corresponding to the steps taken by the MaxMin algorithm, adds two points (red) at each iteration | S |

Starting Problem

reyDeletion Greedy reyMaxMin Greedy to the subset, whosealgorithm. distance Here, is the theare two largest then points in evaluated, in the and the entire the candidate candidate point set set. with which the The are closest second closest nearest row neighbour together corresponds is are removed to selected from the (green). the steps candidate Their taken set. nearest by neighbours the (orange) Deletion Table 5.2: Comparison ofsubset, the where greedy MaxMin and greedy Deletion algorithms on a candidate set in 5.2. GREEDY HEURISTICS 215

We see that both greedy algorithms add the points A and B to the subset, as would be expected if our aim is to construct a geometrically diverse subset. However, the two greedy algorithms differ when selecting points from the cluster, (CDEFG), in the candidate set. The MaxMin algorithm selects two points which qualitatively seem to be the closest two in the cluster, F and G, whilst the Deletion algorithm selects the two points that appear to be maximally separated, D and F. Of course, this example is cherry-picked to demonstrate the difference between the two greedy algorithms, and we do not mean to imply that the Deletion algorithm gives a more geometrically diverse subset than the MaxMin algorithm. Counter examples can be constructed which favour MaxMin over Deletion.

In Figure 5.2, we present the average error in the total molecular energy for the same 500 external test cases that were used in Section 5.2.2 as we increase of the size of the training set constructed by the greedy Deletion algorithm.

The results in Figure 5.2 are very similar to those presented for the MaxMin algorithm in Figure 5.1. If anything, the greedy Deletion algorithm appears to be slightly outperformed by the greedy MaxMin algorithm owing to the slower rate of convergence to the random benchmark in both the water and methanol systems. We also present the maximum errors associated with each kriging model in Table 5.3. 216 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING

Figure 5.2: Average molecular energy error, with accompanying standard devi- ations, in the testing set conformations given a training set constructed by the Deletion algorithm from the respective candidate sets at a number of training set sizes. In each pane, the dark black line corresponds to the average molecular en- ergy error in the testing set given a training set of size 1000, generated randomly from the candidate set. 5.2. GREEDY HEURISTICS 217

Maximum Prediction Error (kJ mol−1) Training Set Size Water Methanol NMA Glycine 100 11.716 21.181 54.513 47.916 200 7.692 17.110 32.058 42.534 300 7.001 16.342 21.750 39.598 400 7.676 13.813 22.212 50.454 500 5.589 15.473 20.631 39.508 600 5.413 14.992 18.562 38.762 700 5.372 11.325 17.696 36.581 800 5.335 12.976 16.319 35.033 900 4.247 14.498 14.305 30.740 1000 4.486 16.217 15.043 30.707 Random 4.488 7.213 14.993 15.061

Table 5.3: Maximum prediction errors for each small molecule kriging model as the training set size is increased. Maximum prediction errors for the randomly constructed training sets are presented in the final row.

Relative to the maximum errors in Table 5.3, the Deletion methanol model is particularly poor. The training sets constructed for methanol have maximum errors that are virtually double those of the MaxMin training sets. In saying this, the maximum error attributed to the Deletion glycine model is roughly 2 kJ mol−1 lower than the equivalent MaxMin model. The prevailing conclusion is that the maximum errors from both greedy algorithms are significantly larger than the random benchmarks in most cases. The large maximum errors attributed to the greedy algorithms are particularly strange, and we have struggled to explain them. Even given the alterations we will make in the next section to the greedy algorithms, it is quite difficult to reason why, in their current incarnations, they perform so poorly relative to the random methodology. 218 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING

5.2.4 Alternative Distance Metric

Before dismissing the greedy heuristics, we need to remedy a conceptual flaw with the above implementations. In both the MaxMin and Deletion schemes, points are added or removed from the subset based on a geometric criterion, which in effect attempts to maximise the Euclidean distance between all points within the resultant subset.

Kriging makes a prediction of the response function at untrained points based on the value of response function at trained points, weighted by the distance between the trained point and the prediction point. So, training points that are further away from the prediction point contribute less to the prediction of the response function at the prediction point, while training points that are closer contribute more. This predictive methodology is captured by the covariance between points, i.e. the kriging kernel, and is governed by Equation (2.4.1).

Equation (2.4.1) scales the L1−norm, or taxicab metric, in the kth degree of free-

1 th dom by θk. This scaled L −norm is then exponentiated in the k degree of freedom by pk. The exponential of the negative sum of these “scaled-exponentiated” values is then taken over all degrees of freedom.

It stands to reason that the covariance between two points has some informal role as a distance metric. It resembles an L1−norm on a scaled Gaussian space in the limit that pk = 2 ∀k = 1, ... d, a scaled exponential space in the limit that pk = 1 ∀k = 1, ... d, and some scaled intermediary space otherwise. At the very least, without wishing to present a rigorous proof that the covariance and distance between points are correlated, we posit that adopting the Euclidean norm as a selection function is not appropriate. 5.2. GREEDY HEURISTICS 219

An issue immediately arises in adapting the MaxMin and Deletion algorithms for n o the covariance space. We require a priori knowledge of θˆ, pˆ , solutions to the maximum log-likelihood function of Equation (2.4.9), to evaluate the correlation between points. This information is obviously not available until a kriging model has been constructed from a training set. A potential workaround for this would n o be to approximate θˆ, pˆ using a small training set. We could subsequently use these approximations to approximate the covariance between points.

We have attempted to ascertain whether such an approximation is viable. We have evaluated the evolution of θ as the training set size increases for each IQA component, for each atom in the water molecule. We have omitted the evolution of p from our analysis since its value remains virtually constant as the training set size increases. We have qualitatively established that the values of θ stabilise when the training set size reaches roughly 700. We satisfy ourselves with quali- tative convergence since we only mean to approximate the hyperparameters, and so we do not attempt to invoke any numerical measure for convergence. We have verified that this qualitative convergence is exhibited for methanol, but it becomes increasingly difficult to analyse the data in higher-dimensional spaces. As such, we have not undertaken any equivalent analysis for NMA or glycine.

We have revised the greedy algorithms of 5.2.2 and 5.2.3 to utilise the approximate kriging hyperparameters, obtained from kriging models of a randomly generated training set for each small molecule. Each training set comprises 1000 points, which is perhaps excessive given that we have established that convergence occurs at roughly 700 training points. The results for the MaxMin and Deletion algorithms with the adapted distance metric are given in Figures 5.3 and 5.4 and Tables 5.4 and 5.5. 220 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING

Figure 5.3: Average molecular energy error, with accompanying standard devi- ations, in the testing set conformations given a training set constructed by the MaxMin algorithm utilising the kriging covariance as a distance metric. In each pane, the dark black line corresponds to the average molecular energy error in the testing set given a training set of size 1000, generated randomly from the candidate set. 5.2. GREEDY HEURISTICS 221

Figure 5.4: Average molecular energy error, with accompanying standard devi- ations, in the testing set conformations given a training set constructed by the Deletion algorithm utilising the kriging covariance as a distance metric. In each pane, the dark black line corresponds to the average molecular energy error in the testing set given a training set of size 1000, generated randomly from the candidate set. 222 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING

Maximum Prediction Error (kJ mol−1) Training Set Size Water Methanol NMA Glycine 100 9.961 20.895 40.027 38.391 200 6.632 8.982 24.727 21.323 300 7.292 9.465 22.908 14.215 400 7.904 8.171 21.216 15.543 500 6.995 7.514 20.131 15.056 600 4.909 8.182 18.683 14.180 700 4.919 7.015 16.023 14.450 800 4.351 6.425 16.208 13.331 900 4.058 6.733 15.426 14.227 1000 4.490 8.190 16.396 13.979 Random 4.488 7.213 14.993 15.061

Table 5.4: MaxMin-constructed training set maximum prediction errors for each small molecule kriging model as the training set size is increased. Maximum pre- diction errors for the randomly constructed training sets are presented in the final row. 5.2. GREEDY HEURISTICS 223

Maximum Prediction Error (kJ mol−1) Training Set Size Water Methanol NMA Glycine 100 13.119 18.764 45.476 25.421 200 7.227 13.838 26.695 18.256 300 7.256 7.892 21.817 12.393 400 7.636 8.440 16.468 19.797 500 5.392 6.744 16.781 13.618 600 5.730 6.472 16.532 13.306 700 5.586 6.624 16.566 14.537 800 5.007 7.427 16.242 13.738 900 4.039 7.439 14.632 13.193 1000 4.484 7.095 13.061 13.132 Random 4.488 7.213 14.993 15.061

Table 5.5: Deletion-constructed training set maximum prediction errors for each small molecule kriging model as the training set size is increased. Maximum pre- diction errors for the randomly constructed training sets are presented in the final row.

Based on the average errors attributed to each of the kriging models obtained from the MaxMin and Deletion algorithms, we can always find training sets that outperform the random benchmark training set. For water, both the MaxMin and Deletion algorithms construct training sets that outperform the random bench- mark by a training set size of 800 points. For methanol, the MaxMin algorithm outperforms the random benchmark by roughly 0.3 kJ mol−1 at a training set size of 700, but the training set constructed by the Deletion algorithm fails to improve upon the random training set. The situation is reversed for NMA, where the Dele- tion algorithm constructs a training set that outperforms the random training set 224 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING by roughly 0.1 kJ mol−1 at a training set size of 1000, whereas the MaxMin train- ing sets are outperformed by the random training set. Finally, for glycine, we find that both the MaxMin and Deletion training sets outperform the random training set by roughly 0.1 kJ mol−1.

In terms of implementing these strategies, our mix of results proves problematic. Ideally, we would like to implement a subset selection technique that consistently improves upon the randomly constructed training sets. Whilst both of the greedy algorithms can outperform the random subset selection, both can also be outper- formed by the random training set selection methodology.

If we are to qualify our subset selection strategy by the maximum error attributed to the model, then we see from Tables 5.4 and 5.5 that the Deletion algorithm systematically outperforms the random training set construction by a significant amount, almost 2 kJ mol−1 for NMA and glycine. MaxMin, on the other hand, is not quite as successful, being outperformed in this category for methanol and NMA. As such, we have tentatively concluded that the greedy Deletion algorithm is a preferable candidate for subset selection, since it commonly outperforms the random subset selection in terms of average error, and systematically outperforms the random subset selection in terms of maximum error.

A number of problems can be attributed to the implementation of the greedy Deletion algorithm. We note that both the maximum error and average error of a kriging model do not decrease monotonically with training set size. There seems to be no way in which we can determine, a priori, an ideal compromise between minimising the size of the training set and error associated with the resultant kriging model. For example, the optimal training set in terms of average error occurs at training set sizes of 800, 700, 1000 and 1000 for water, methanol, NMA and glycine, respectively. In contrast, the optimal training set in terms of 5.3. SEQUENTIAL SELECTION 225 maximum prediction error occurs at training set sizes of 900, 600, 1000 and 1000. One could systematically increase the training set size until the errors associated with some testing set begin to increase, but the construction of numerous kriging models is time-consuming.

We imagine that as the size of the candidate set increases, there is more variety with which to construct a training set. Therefore, a larger candidate set should yield better training sets given an appropriate subset selection methodology. However, employing a “top-down” approach to subset selection is problematic for a large candidate set, as we can illustrate through example. Given a candidate set con- sisting of 10,000 molecular conformations. If we wish to construct a training set comprising 1000 molecular conformations, the Deletion subset selection requires 9000 iterations, whereas MaxMin subset selection requires 1000 iterations. The additional computational effort required for Deletion over MaxMin is obviously amplified as the candidate set increases. One can then appreciate that a “bottom- up” approach to subset selection from a large candidate set is preferable.

5.3 Sequential Selection

The clustering of training points in conformational space negatively impacts on the quality of a kriging model, where the global predictive capacity of the kriging model is known to deteriorate. However, if the topology of the response function surface is particularly undulant in a certain area, more training points will be required in this area to capture its topological features. Therefore, a purely geometric criterion for subset selection, no matter how elaborate, can never be seen as an optimal means for subset selection for kriging. Ideally, one would possess information about the response function so that undulant regions can be represented by a higher training 226 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING density. In other words, some variant of importance sampling is required to direct sampling to these difficult-to-model regions of conformational space.

The issue with this approach for our purposes is that information about the re- sponse function is only available after the computationally intensive ab initio cal- culations required to obtain the atomic properties being kriged. One method for circumventing this issue is by crudely computing the response function at a small number of points in conformational space, and evaluating which regions appear to be unsatisfactorily represented. For example, if two training points, x and x0, are

0 deemed sufficiently close, i.e. d(x, x ) < dmax, we can impose some requirement that the response function is “sufficiently continuous”, i.e. |f(x) − f(x0)| < , where dmax and  are suitably chosen parameters. If the two points fail to meet this criterion, then another point is added to the training set between x and x0 in some fashion.

The issue with the above methodology is that subset selection becomes an inher- ently serial task. A set of ab initio calculations are performed, new conformations are determined followed by a new set of ab initio calculations. This process is then iterated until the response function between all points is “sufficiently continuous”. There is no scope for a parallel workflow since each set of newly proposed con- formations is explicitly dependent upon the previous set of ab initio calculations.

In addition, the choice of parameters dmax and  is arbitrary, and presumably system-dependent.

As such, we adopt the scheme employed in the work of Rennen[319], that of Se- quential Selection. With sequential selection, a large candidate set is constructed, with the response function having been evaluated at each point. An initially small

1 training set is constructed from the candidate set, |S| = n0 . Using S, a kriging

1The manner in which this small training set is constructed is irrelevant if it sufficiently small. 5.3. SEQUENTIAL SELECTION 227 model is constructed and the response function at each point in the candidate set but not in the training set, Ω\S, is predicted. Some subset of the worst predicted points in Ω\S is added to the training set, and the whole process iterated until the training set is of adequate size, or all points are predicted with an error below some tolerance. The latter criterion can be formalised by defining the error of predicton for a conformation, xi, to be given by E(xi) = |fpred(xi) − fact(xi)| < E, where fpred(xi) and fact(xi) are the predicted and actual values of the response function at xi, respectively, and E is a pre-defined error tolerance. The sequential selection scheme is illustrated in Algorithm4.

1. Take S = ∅ 2. Select n points from Ω and add to S  0    while |S| < N ∨ E(xi) < E ∀xi ∈ Ω\S do

Calculate E(xi) = |fpred(xi) − fact(xi)| ∀xi ∈ Ω\S Add some some subset of those points whose E(x) is highest to S end Algorithm 4: Sequential Selection Algorithm

We have left the definition of “some subset of the worst predicted points” pur- posefully vague as there are a number of strategies one can adopt in defining this subset. The best possible training set will result when the worst predicted point is added to the training set at each iteration. Following the formalism of Rennen, we call this strategy SS1. However, building up a training set one point at a time, whilst having to recompute a kriging model at each iteration requires a great deal of time, and is naturally not an ideal strategy.

A second approach would be to add the n worst predicted points to the training set at each iteration, which we denote SSn. We indicate that SS1 is the limiting case to the SSn methodology. Adopting the SSn, with n > 1, strategy is not equivalent to an accelerated version of SS1. In adding the n worst predicted points, where n > 1, from a kriging model, it is likely that a number of these worst predicted 228 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING points reside in a similar, poorly-modelled region of conformational space. As such, the addition of these points to the training set results in a clustering of points in the new training set, leading to over-training in that region. However, SSn is a far more efficient means for constructing a training set than SS1. Following Rennen, we have chosen to evaluate the performance of SS10, where the 10 worst predicted points are added to the training set at each iteration.

The final approach we propose is an attempt to circumvent the potential for over- training as we have indicated is a flaw to the SSn scheme. To do this, we add n points to the training set from uniformly sampling the m worst predicted points, where m > n. This scheme is denoted SSmn. Again, following Rennen, we have chosen the SS1040 methodology, where of the 40 worst predicted points, we select 10 uniformly (i.e. the 40th, 36th, 32nd, ... worst predicted points) and add them to the training set at each iteration. The SSmn methodology works under the assumption that points with similar errors occupy similar regions of conformational space. By uniformly selecting points from the worst points, one prevents the addition of conformationally similar points to the training set.

The sequential selection methodology is highly labour intensive, and so we restrict ourselves to a single test case; the exchange-correlation energy of the carbon atom in methanol. We have found in past work that the exchange-correlation energy appears to be the most difficult to model by kriging, and so seems an appropriate candidate for evaluating the performance of the sequential selection methodologies outlined above. We use the candidate set comprising 3500 models that was used in Section 5.2, and construct random training sets comprising 595 conformations for SS1, and 550 conformations for both SS10 and SS1040. Each training set is then built up to 600 conformations. The performance of the resulting models is evaluated on the rest of the candidate set not included in the training set at 5.3. SEQUENTIAL SELECTION 229 each iteration and the appropriate points are added to the training set from the candidate set at each iteration. Additionally, we have constructed an external testing set of 200 conformations to compare the performance of each sequential selection algorithm.

Figure 5.5 gives the cumulative distribution functions for the errors associated with each of the sequential selection methodologies along with a random benchmark. It is immediately evident that both the SS10 and SS1040 models outperform the random benchmark quite significantly considering they differ by only 50 training points. Surprisingly the SS1 model performs rather poorly, but we can attribute this to having only added 5 points to the training set, which we assume to have little impact on the quality of the training set. For the SS1 model to perform better, it would be preferable to begin from an initial training set size of 550 and iterate the algorithm 50 times to generate a training set with 600 points. However, as we have previously mentioned, this is a particularly laborious task. 230 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING

100

90

80

70

60 Random SS1 50 SS10

Percentile SS1040 40

30

20

10

0

0.01 0.1 1

Prediction Error (kJ/mol)

Figure 5.5: Cumulative distribution functions of errors associated with the sequen- tial selection kriging models. SS1 (green), SS10 (blue) and SS1040 (yellow) are compared to a randomly constructed training set benchmark (red). Whilst the SS1 model performs similarly to the random benchmark, the SS10 and SS1040 both considerably outperform the random benchmark. 5.3. SEQUENTIAL SELECTION 231

We can also plot the average error, maximum error and average standard devi- ations for each method over the course of iteratively building the training sets. The tabulated data for these quantities are presented in Tables 5.6, 5.7 and 5.8, respectively.

Average Prediction Error (kJ mol−1) Iteration Number Random SS1 SS10 SS1040 . 1 . 0.811 0.774 0.790 . 2 . 0.856 0.856 0.638 3 0.798 0.852 0.643 0.596 . 4 . 0.646 0.638 0.779 . 5 . 0.854 0.653 0.630

Table 5.6: Average errors associated with the SS1, SS10 and SS1040 kriging models relative to the random benchmark.

Maximum Prediction Error (kJ mol−1) Iteration Number Random SS1 SS10 SS1040 . 1 . 3.317 3.620 2.964 . 2 . 4.548 3.344 2.288 3 3.418 4.230 2.423 2.082 . 4 . 2.191 2.281 2.988 . 5 . 3.488 2.649 2.376

Table 5.7: Maximum errors associated with the SS1, SS10 and SS1040 kriging models relative to the random benchmark. 232 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING

Standard Deviation of Errors (kJ mol−1) Iteration Number Random SS1 SS10 SS1040 . 1 . 1.037 1.018 1.016 . 2 . 1.113 1.119 0.812 3 1.034 1.105 0.797 0.763 . 4 . 0.815 0.801 1.010 . 5 . 1.093 0.826 0.794

Table 5.8: Standard deviation or errors associated with the SS1, SS10 and SS1040 kriging models relative to the random benchmark.

The data in the tables corroborates well with what we observe in the cumulative distribution of errors in Figure 5.5. Each metric confirms that the SS10 and SS1040 algorithms outperform the random benchmark by a great deal. It is a little difficult to quantify whether SS10 outperforms SS1040, or vice verse, as the data presented appear to fluctuate, with one outperforming the other at one iteration, and then reversing for the next. However, a maximum error almost 1 kJ mol−1 and an average error just over 0.2kJ mol−1 lower than the random benchmark is vastly impressive given so few iterations of the sequential selection algorithm.

As mentioned at the beginning of this section, the sequential selection subset selection strategies are not ideal for our purposes owing to the need for a large volume of ab initio calculations that can be parsed into the resultant training set. It is for this reason that we do not pursue the sequential selection strategies in the following work. The results presented in this section confirm that the sequential selection strategies certainly outperform the random benchmark, but they are not feasible strategies owing to their wastefulness of computationally intensive data, and the requirement that kriging models be built iteratively. However, evaluating the performance of the sequential selection methodologies is of use, since they can be used as benchmark subset construction calculations in future work. 5.4. ITERATIVE VORONOI SUBSET SELECTION 233 5.4 Iterative Voronoi Subset Selection

Both the greedy heuristics and the sequential selection algorithms require a candi- date set of data points from which to generate a subset. The sequential selection algorithms are particularly expensive as the explicit calculation of the response function is required for each data point in the candidate set. Calculations of the response function are so time consuming that they are not feasible for large systems or large candidate sets.

A second unrelated issue is that the construction of a subset from a discrete can- didate set is not optimal. The subset that one is able to construct is entirely dependent upon the quality of the points within the candidate set. For example, if each point in the candidate set is clustered, then we have no means to prevent our subset from being clustered. A preferable scheme would allow for the selection of datapoints from the continuous space that we are attempting to model, without the restrictions imposed by selection from a discrete set of points.

Formally speaking, given some pre-defined domain, we wish to add points to a subset from within the domain, such that the added points are as mutually dis- similar as possible. An immediate issue is in the definition of the domain within which we sample. To sample within a domain, one requires boundary conditions, allowing for the elucidation of whether a point lies within the domain or outside of it. We avoid discussion of these boundary conditions momentarily. This is a topic that merits an entirely separate discussion, which we undertake in Section 5.5. Rather, we proceed under the assumption that we have been given some ini- tial subset S := {x1, ... , xs}, from which we are able to determine the boundary conditions of the subset, i.e. we have knowledge of the polytope on Rd bounding S. 234 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING

Our formal problem is actually a variant of the Largest Empty Sphere Problem,

 d Definition (Largest Empty Sphere). For a set of points, S := x1, ... , xs|x ∈ R , and the associated bounding polytope, D, find the largest hypersphere that contains no elements of S, whose centre is in the polytope. Or,

max min{||x − xi|| | i = 1, ... , s} . (5.4.1) x∈D

A qualitative enunciation of the above definition will prove instructive. We are required to find the radial distance between a query point, x, and the closest point in the subset, xi. We then find the maximum such distance over every conceivable query point in the domain[320]. Whilst a conceptually simple problem, its solution is NP-hard, and brute force approaches (which scale as O(|S|4)) are conventionally adopted. However, for high-dimensional spaces, such systematic means for solution are not tractable. Note also that the number of possible query points within the domain is uncountably infinite, and so some scheme is required to discretise the space.

We here find it convenient to introduce the concept of a Voronoi tessellation:

d Definition (Voronoi Tessellation). Given a set of points, S := {x1, ... , xs|x ∈ R } and a distance metric, d(xi, xj). The Voronoi cell of xk ∈ S, Vk, is defined as

d Vk := {x ∈ R | d(x, xk) ≤ d(x, xm))} , (5.4.2) where xm ∈ S\xk.

The Voronoi tessellation possesses a relatively straightforward interpretation; given a set of points, the Voronoi cell belonging to a query point is equal to the region of space that is closer to (or equal to) the query point than any other point in the set. The equality condition is of particular importance, since it implies the existence 5.4. ITERATIVE VORONOI SUBSET SELECTION 235 of points that belong to more than one Voronoi cell. When a point belongs to two Voronoi cells, it defines a partitioning between the Voronoi cells. When a point belongs to three or more Voronoi cells, is defines a vertex of the Voronoi tessellation. These features are demonstrated in Figure 5.6.

Figure 5.6: Voronoi tessellation of 12 points in R2, denoted by black points. The Voronoi cell corresponding to each point is coloured. Voronoi cells are partitioned from one another by lines. Where three or more Voronoi cells meet, a vertex is formed. Image generted by the interactive Voronoi diagram generator, located at http://www.alexbeutel.com/webgl/voronoi.html.

The notion of the Voronoi tessellation is immediately generalisable to a space of any dimensionality. In R3, for example, the boundary between two Voronoi cells is a plane, the boundary between three Voronoi cells is a line, and a vertex is defined by the boundary between four or more Voronoi cells. A vertex of a Voronoi tessellation in Rd is defined by the boundary between d + 1 or more Voronoi cells.

Solution to the largest empty sphere problem and the Voronoi tessellation on S are related. It can be shown that the centre of the largest empty sphere in S coincides with the vertex of the Voronoi tessellation of S that is furthest from all 236 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING the elements of S. This link allows us to recast (5.4.1) as a maximisation over the discrete vertices of a Voronoi tessellation as opposed to a maximisation over the continuous domain enclosed by the bounding polytope[321]. Solution to the largest empty sphere problem by the Voronoi tessellation scales as O(|S| log |S|), which is certainly an appealing prospect. However, neither the memory requirements or the scaling with dimensionality of the space in which S is defined are well-documented.

Given this link between the Voronoi tessellation and the largest empty sphere, we propose a novel subset construction algorithm, which we enumerate in Algorithm 5, and refer to as the Iterative Voronoi Subset Selection (IVSS) algorithm.

1. Take S = {x1, ... , xs} while |S| < N do Construct the Voronoi tessellation of S Collect the vertices of the Voronoi tessellation, V := {v1, ... , vn} Evaluate maxv∈V min{||v − xi||}, where xi ∈ S Add corresponding v to S end Algorithm 5: Iterative Voronoi Subset Selection Algorithm

We have used water to establish a proof-of-concept for the IVSS. An initial training set was randomly generated, consisting of 10 points. We have then used the freely available qVoronoi[322] code to Voronoi tessellate the training set. qVoronoi operates by first computing the Delaunay triangulation of the points provided, and subsequently obtains the Voronoi tessellation as the dual space to the Delaunay triangulation[323]. The training set for water was iteratively grown until the train- ing set size had reached 100. Comparison was then conducted against a randomly constructed training set also containing 100 points. An external testing set com- prising 300 points was used for validation. We present the cumulative distribution function for the errors in Figure 5.7.

From Figure 5.7, it is evident that the Iterative Voronoi scheme significantly out- 5.4. ITERATIVE VORONOI SUBSET SELECTION 237

100

90

80

70

60

Random 50 IVSS Percentile 40

30

20

10

0

0.001 0.01 0.1 1 10

Prediction Error (kJ/mol)

Figure 5.7: Cumulative distribution function of errors for a training set constructed by the Iterative Voronoi Subset Selection (blue) and a randomly constructed train- ing set (red). 238 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING performs the randomly constructed training set. The mean error associated with the random training set is 2.64 kJ mol−1, whereas that associated with the IVSS training set is 2.09 kJ mol−1. The maximum error associated with the random training set is 17.76 kJ mol−1, whereas that associated with the IVSS training set is 12.83 kJ mol−1. We note that these values are not quite as good as those obtained in Section 5.2.4, but we can associate this with an unoptimal selection of points included in the initial subset. Since the IVSS can only add points from within the domain bounded by the initial subset, an informed choice of these points is required. As such, we restrict our critique to direct comparison with the random subset alone.

We have encountered a number of problems in moving to higher dimensional cases. The qVoronoi algorithm stores each boundary surface in main memory, and computes the vertices of the Voronoi tessellation as the intersection of d such hyperplanes. For methanol, where d = 12, we have found that the memory required to store this amount of data exceeds the amount of memory available on the HPC facilities that are available to us at the present time. Indeed, one can appreciate that when the Voronoi tessellation is taking place in a high dimensional space, with a large number of Voronoi cells, the procedure is simply not tractable.

As such, while we have provided a proof-of-concept, illustrating that the IVSS scheme significantly outperforms a random benchmark, it is unfortunately not a viable methodology given the high dimensionality of the spaces that we regularly work in. It is possible that the dimensionality of the problem could be reduced by a judicious choice of dominant dimensions for the problem, but this process has been a contentious issue without our group, and is the subject of ongoing research. 5.5. DOMAIN OF APPLICABILITY 239 5.5 Domain of Applicability

5.5.1 Introduction

A kriging model is valid upon the domain it has been trained. Since the kriging model has no information concerning the topology of the response function outside of the training domain, it cannot make informed predictions of the response func- tion outside of the training domain. This is not to say that kriging is not predictive in these regions. Analysis of Equation (2.4.11) shows that as a query point gets further from all other training points, the second term of Equation (2.4.11) goes to zero. The prediction of the response function at a query point is then equal to the mean of the response function as computed from the optimal hyperparameters, {θˆ, pˆ}, Equation (2.4.4). Therefore, a kriging model can predict a point outside of the training set to have a response function value equal to the mean of the response function over the training domain. However, kriging is not meant to be extrapolative, and so if the response function has a complex topology in untrained regions, the mean is not a particularly useful prediction.

Our aim is to ensure that the predictions made by a kriging model are consistently accurate. We can go some way to achieving this aim by specifying the domain over which the kriging model has been trained, and is therefore valid. We term this domain the Domain of Applicability[324], or DoA. To characterise a domain, we require some means to define its boundary surface. We can subsequently elucidate, before making a prediction, whether the kriging model is a valid description of the response function at a query point. If the query point does not lie in the DoA, we can either add it to the training set and retrain the kriging model, or flag the query point as being inappropriately modelled by the kriging model. 240 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING

A number of strategies exist for defining the DoA of a set of data, which we briefly review.

Range-Based[325, 326]

One geometric way of defining the boundary surface would be to consider the conformational space of a molecule in Rd. Along each degree of freedom, we can determine

max min xk = max xk xk = min xk , (5.5.1) xk∈S xk∈S the maximum and minimum values that the training set spans. By placing a (d−1)-

max min dimensional hyperplane orthogonal to each axis xk at xk = xk and xk = xk , we can construct a DoA as everything enclosed by these hyperplanes. So, in R3, a series of planes on R2 define the boundaries of the DoA, which is a cube. However, it becomes immediately apparent that such a scheme is bound to fail. Imagine four points make up a training set in R3, and form the vertices of a tetrahedron. Defining the DoA as a cube is inappropriate since a tetrahedron occupies one fifth of the volume of a bounding cube, and so four fifths of the space contained within the DoA is not represented within the training set. Those untrained regions lying within the DoA are referred to as “empty regions”[327].

Distance-Based[328, 329]

Distance-Based methods can be used to evaluate the DoA without explicitly com- puting the boundary surface of the DoA. The Mahalanobis distance[330] is a mea- sure of the distance between a point (the query point) and a distribution (the training set). Given a test set, one can evaluate the distance between each query point within the test set and the training set, and determine points within the 5.5. DOMAIN OF APPLICABILITY 241

DoA to be those which are less than a threshold Mahalanobis distance from the training set. The existence of a threshold distance is unwelcome as it renders the methodology dependent on some user-defined parameter, which is difficult to characterise.

A variant, and slightly more robust form, of the conventional Distance-Based method is the k−nearest neighbours approach, where the distances between a query point and the k nearest training points is found. If all of these distances fall within some threshold distance, then one can assume the query point to be well-predicted[331]. However, the k nearest neighbours approach still relies upon a user-defined parameter, and so is not ideal.

Probability Density Distribution[332]

The final methodology we wish to mention is the Probability Density Distribution Method, where for every test point, the training data is evaluated in the corre- sponding region and the density of training points evaluated. If the density is lower than some threshold value, then the test point is considered to lie outside of the DoA. Probability Density Distributions are able to identify empty regions of the training space, making them rigorous means for defining the DoA. However, it should be noted that Probability Density Distributions are particularly difficult to implement in high-dimensional spaces[333], and the parameterisation of sufficient sampling density is, like the Mahalanobis distance, undesirable.

Alternative methodologies for the evaluation of a DoA do exist, such as Decision Trees[334] and Stepwise Approaches[335], but have been less well-reported in the literature. The attentive reader will notice that each of the references that has been cited in this section comes from the field of Quantitative Structure-Activity 242 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING

Relationship (QSAR). Without wishing to delve into the complexities of this field, suffice to say that our problems are somewhat different from the problems tackled in QSAR. The biggest difference is in the dimensionality of the space in which we train our kriging models. Values taken by data points are also frequently discrete. The concept of a DoA for our particular problem is not well-characterised, and so we attempt to pursue an alternative methodology to define the DoA.

5.5.2 The Convex Hull

An appealing method for defining the boundary surface of the DoA is the convex hull of the set of the training set[336]. The convex hull, CS, of a set of points, S, is defined as the set of all convex combinations of the constituent points of S, where:

 d Definition (Convex Combination). For a set of points, S := x1, ... , xn|x ∈ R , their convex combination is defined as   Pn  i=1 ki = 1 k1x1 + k2x2 + ... + knxn s.t.  0 ≤ ki ≤ 1 ∀i = 1, ... , n

The set of all convex combinations (i.e. each possible set of {ki} subject to the constraints) of S defines a minimal domain that contains all elements of S. The convex hull for S is then given by

 ( )  Pn [ X  i=1 ki = 1 CS := kixi s.t. . (5.5.2) xi∈S  0 ≤ ki ≤ 1 ∀i = 1, ... , n

The convex hull defines a d−polytope surrounding all points in a set on Rd. To construct a d−polytope in Rd, one requires at least d + 1 points. So, on R, 5.5. DOMAIN OF APPLICABILITY 243 two points, {A, B}, are required to construct a 1-polytope, i.e. a straight line. The convex hull of {A, B} consists of all points between {A, B}, each of which can be written as a linear combination of {A, B}, subject to the constraints of a convex combination. On R2, three points, {A, B, C}, are required to construct a 2-polytope, i.e. a triangle. The convex hull of {A, B, C} consists of all points within the triangle defining the convex hull, each of which can be written as a linear combination of {A, B, C}, subject to the constraints of a convex combination. We can generalise this result to as many dimensions as required.

If some query point, y, lies within the convex hull, CS, it can be written as a convex combination of all the points contained within S which satisfies the constraints placed on the {ki}. Or, algebraically,

    x1 ... xn y       k =   , (5.5.3) 1 ... 1 1

 > where we have introduced the vector k = k1 ... kn . Note that the final equation ensures the normalisation of k.(5.5.3) corresponds to a set of d+1 linear equations in n unknowns. For the typical case where n  d + 1, the system of equations is underdetermined, and a solution vector k cannot be uniquely defined.

One way to circumvent this system of underdetermined equations would be to explicitly compute CS. However, with conventional software, this is not possible since the best methods available scale as O(nd/2), which is problematic for the high-dimensional spaces we consider with a great deal of training points. Indeed, we have found that the popular software package, qHull[322], is not capable of computing the convex hull of a system in R12 and above. We can, however, make 244 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING explicit use of Carath´eodory’s Theorem to form a system of d + 1 equations in d + 1 unknowns. Carath´eodory’s Theorem states:

d Theorem (Carath´eodory’s Theorem). Given a point y ∈ R that lies within CS, then, n o

∃S := x1, ... , xd+1 S ⊆ S , such that y lies within CS .

In other words, we are able to define a subset, S ⊆ S, of cardinal d + 1, such that y lies within the convex hull of S . We can then revise (5.5.3) to

    x1 ... xd+1 y       k =   , (5.5.4) 1 ... 1 1

 > with k = k1 ... kd+1 .(5.5.4) is now a series of d + 1 equations in d + 1 unknowns, and has a unique solution. If y lies within CS , then there exists a vector k that satisfies the normality constraint of k, with no individual 0 ≤ ki ≤ 1.

The issue with the above methodology is that if y lies within CS, then we cannot arbitrarily select S := {x1, ... , xd+1} ⊆ S and expect y to lie within CS , since S CS is defined as S ⊆S CS . Therefore, to rigorously specify whether y lies within n  CS, we must systematically assess whether it lies in d+1 possible convex hulls. Combinatorially scaling methods are certainly not desirable where n  d + 1, and so we address this issue in Section 5.5.4. 5.5. DOMAIN OF APPLICABILITY 245

Figure 5.8: The Ackley function in two dimensions. The function is projected down onto the xy-plane as a colour-contour plot. 5.5.3 Test Case

To evaluate whether the convex hull is an appropriate boundary surface for the DoA, we must test whether points predicted outside of the convex hull defined by the training set have higher errors than those predicted from within the convex hull. A convenient low-dimensional test case is the Ackley function,

 v  u d d ! u1 X 1 X f(x) = −a exp −bt x2 − exp cos(cx ) + a + exp(1) , (5.5.5)  d i  d i i=1 i=1 where recommended values for the parameters are a = 20, b = 0.2, c = 2π. The Ackley function is a standard problem for global optimisation algorithms, as the number of local minima renders optimisation methods based on spatial derivatives inefficient. The Ackley function is shown in Figure 5.8. 246 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING

2 We use the Ackley function on R as a test case on the domain x1, x2 ∈ [0, 20]. Our convex hull will be the polygon with vertices (0, 0), (20, 0) and (10, 20). 1000 random points were chosen on x1, x2 ∈ [0, 20], and were assigned to either belonging within the convex hull or lying outside of the convex hull. With those points within the convex hull, we have constructed a training set of 50 points, including the vertices of the convex hull. Then, 400 points from inside of the convex hull and 400 points from outside of the convex hull were predicted and compared to the analytical results of the Ackley function at those points. The average errors and mean squared errors associated with each test point are given in Figure 5.9.

6 14 In Hull In Hull Out of Hull Out of Hull

12 5

10 4

8

3

6 Prediction Error Mean Squared Error

2 4

1 2

0 0

Figure 5.9: Absolute error (left) and MSE (right) of predictions made from within the convex hull (yellow) and outside of the convex hull (blue) of the training set.

Both the average error and average MSE of predictions of points outside the convex hull are more than three and eight times larger, respectively, than their counter- parts that lie within the convex hull. This result is indicative of the fact that the convex hull is a natural means for partitioning an unbounded space into a DoA and untrained region. Naturally though, we need to apply this methodology to 5.5. DOMAIN OF APPLICABILITY 247 relevant systems that we wish to krige. We undertake this in the next section.

5.5.4 Largest d-Polytope

As we have seen, invoking Caratheodory’s Theorem allows us to define a convex

d hull, CS on R by use of d + 1 points. However, it is the union of all such convex hulls of d + 1 points from a training set of n points that gives the convex hull of all n points. Computation of the convex hull in this fashion is not feasible owing to the associated combinatorial scaling of the problem.

We have formulated two distinct strategies for circumventing the combinatorial scaling. The first is to use some small number of convex hulls formed by d + 1 points, and take their union as representative of the convex hull of all n points. The second is to find the d + 1 points within S that form the largest possible n  convex hull from all possible d+1 combinations.

We have chosen to adopt the second strategy. If we find it to be inadequate, then finding the largest possible convex hull formed by d + 1 points is still of use, since this convex hull can then be used in the first strategy in a combination with a number of other convex hulls. Any approximation strategy will perceive some points within the true convex hull to lie outside of the approximate convex hull, but our aim is to construct a “guiding” DoA.

To find the d + 1 points that form the largest possible CS unfortunately also re- quires an explicit computation of CS, a task that we are attempting to circumvent.

As such, we introduce a further approximation, where we construct CS from the largest d-polytope (LdP) on S. To find the LdP, we invoke a greedy heuristic where the selection function is one which maximises the distance from a point to 248 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING the polytope that can be constructed from the points currently within S , as we now illustrate.

Given two points in S , we can construct a 1-polytope, F (x1, x2) ≡ F1, and evalu- ate the distance between F1 and each point in S\S , i.e. d(xk,F1) , ∀k ∈ S\S . The point that is furthest from F1 is then added to S , and a 2-polytope constructed,

F (x1, x2, x3) ≡ F2. This process is iterated until |S | = d + 1. Generalising, for th the i iteration of the greedy algorithm, an i-polytope, Fi, is constructed from the points in S and d(xk,Fi) , ∀k ∈ S\S is evaluated. The most distant point is then added to S and the process is iterated.

We give an explicit low-dimensional example of the above process to demonstrate its working. We take a set of points in R3, meaning that we require |S | = d+1 = 4. The first two points in S are selected by choosing the two most distant points in S. The next point is found by evaluating the distance between all points in S\S and the line, F1, formed by the two points in S , the most distant of which is added to S . Now there are three points in S , meaning that F2 is a plane, and so we evaluate the distance between all points in S\S and the plane F2. The most distant point is added to S , and we finish with the four required points to form

CS .

We see that the process can be extended indefinitely so long as we can generalise the notion of the distance between a polytope of arbitrary dimension and a point. We can gain some insight into how to perform such a task by an analysis of water, a low-dimensional case (d = 3). To construct the LdP, we require the distance from a point to a line segment (d = 1), and the distance from a point to a plane (d = 2). Both of these problems have geometrical solutions, where the normal vector to the d−polytope is computed, and the vector from a point on the d−polytope to some point is projected onto the normal vector. The distance from the point to the 5.5. DOMAIN OF APPLICABILITY 249 d−polytope is then obtained.

When attempting to validate the above methodology on a water model, we found that given a candidate set of 1000 points, the LdP contained roughly 30 of those points. Such a low proportion of points lying within the LdP implies that the LdP is a poor approximation to CS. We have chosen to revise the notion of a convex combination of the points forming the LdP. Recall that a convex combination of points is defined as   Pn  i=1 ki = 1 k1x1 + k2x2 + ... + knxn s.t.  0 ≤ ki ≤ 1 ∀i = 1, ... , n

A number of alternative combinations do exist. For example, the conical combi- nation of points is defined as   Pn  i=1 ki ∈ R+ k1x1 + k2x2 + ... + knxn s.t. .  ki ≥ 0 ∀i = 1, ... , n whilst the affine combination places no restriction on the ki. It can be shown that the convex combination of points is a subset of the conical combination of points, which in turn is a subset of the affine combination of points. The more strict the restrictions on the ki, the less space covered by the respective combination of points. By varying the restrictions we place of the ki, we are able to increase the space covered by the combination of points we choose to define the LdP. In the absence of a formal name for this methodology, we simply refer to it as the restricted affine combination of points, which in turn define a restricted affine hull, AS, of a set of points, S. When dealing with AS , we will use the acronym RAH-LdP, the restricted affine hull of the LdP.

To minimise the number of free parameters that we are required to used for the restricted affine combination of points, we have chosen the constraint −∆ ≤ ki ≤ 250 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING

1 + ∆ ∀i = 1, ... , n as well as the normality condition on the ki. Thus, depending on the magnitude of ∆, AS is the union of CS and a “halo region” that extends the domain of CS . Intuitively, this methodology is valid for our purposes, since the points closer to the LdP have a higher probability of lying within the true convex hull of the training data. If we are scrupulous about our choice of ∆, we should obtain reasonable approximations to the true convex hull of the training data.

For our preliminary work, we have chosen ∆ such that a certain percentage of points within the candidate set lies within the RAH-LdP (the “acceptance per- centage”). A suitable metric to evaluate the performance of the LdP is an “average error ratio”- the ratio of the average error from points within the LdP to the aver- age error from points outside of the LdP. So, if the RAH-LdP is able to distinguish between well-predicted and poorly-predicted points, the average error ratio should be small. For water, the progression of both the average error ratio and ∆ with respect to an the acceptance percentage are plotted in Figure 5.10.

We see from Figure 5.10 that ∆ is linear in the acceptance percentage. When the acceptance percentage is low, we find that the average error of points within the RAH-LdP is roughly four times lower than the average error of points outside of the hull. As we increase past an acceptance percentage of 50%, the average error of points inside of the RAH-LdP is roughly six times lower than the average error of points outside of the RAH-LdP. Therefore, for the low-dimensional case, we find that the RAH-LdP is a successful approximation to the DoA.

Unfortunately, generalising the notion of the distance between a point and a d−polytope is not trivial when d > 2. We have attempted to scale-up the idea of projecting the position vector of a point onto the normal vector to the d−polytope. Computing the normal vector to polytopes where d > 2 is not immediately ob- 5.5. DOMAIN OF APPLICABILITY 251

0.28 1 Average Error Ratio Delta 0.95 0.26

0.9 0.24

0.85 0.22

0.8 Delta 0.2 0.75 Average Error Ratio Error Average

0.18 0.7

0.16 0.65

0.14 0.6 10 20 30 40 50 60 70 80 90

Acceptance Percentage

Figure 5.10: Progression of average error ratio (left abscissa) and ∆ (right abscissa) with increasing acceptance percentage for the water monomer (R3). vious, e.g. the normal vector to a tetrahedron. However, we can exploit the fact that the scalar product of the normal vector and a vector lying in the d−polytope is zero. In 3-dimensional systems, the cross product can be used to this effect, but unfortunately such an approach only ensures orthogonality in R3 and R7[337]. As such, we are forced to employ an approximate technique to computing the normal vector in spaces of arbitrary dimensions. We have attempted a Gram-Schmidt orthogonalisation procedure to this end, the algorithm of which has been outlined in Section 3.4.1. The algorithmic workflow that we have proposed for our greedy heuristic is shown in Algorithm6.

However, the Gram-Schmidt orthogonalisation was unsuccessful at producing a usable RAH-LdP in R4 and above. The resultant RAH-LdP contained a very small number of points, and was simply not viable as an approximation to the convex hull. 252 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING

1. Take S = ∅

2. Find maxxi,xj ∈S d(xi, xj) and add xi, xj to S while |S | < d + 1 do i = |S | Form the i-polytope, Fi, from the points in S Find the normal vector to Fi, ni for xk ∈ S\S do Compute vector, vk, from point on Fi to xk d(xk,Fi) = vk · ni end

Add xk to S , where maxk∈S\S d(xk,Fi) end Algorithm 6: The LdP Algorithm

Instead, we have attempted to define the LdP by use of a physically intuitive scheme. Recall that we require d + 1 points to define the LdP. d points can be trivially obtained by projecting the position vector of each point onto a single basis vector at a time. The maximum such value is then used as a vertex for the LdP along the corresponding basis vector. A case in R3 is depicted in Figure 5.11, and the generalisation to higher dimensions is straightforward.

In Figure 5.11, the red points correspond to the candidate set. The three basis vec- tors of the space are xˆ1, xˆ2, xˆ3. We find that the point denoted by position vector a, when projected onto xˆ1, is the maximum such value out of all the candidate set.

We therefore use the blue point, projaxˆ1, as a vertex of the LdP. Equivalently, we can define projbxˆ2 and projcxˆ3 as vertices of the LdP, arising from the projec- tions of b and c onto xˆ2 and xˆ3, respectively. The blue face then corresponds to a 2-polytope formed from the three vertices. The final point required can then be selected by use of the greedy MaxMin heuristic across the entire candidate set. For the sake of argument, assume the point chosen by the MaxMin greedy heuristic resides close to the origin of the coordinate system. Then, the LdP is a tetrahedron of relatively large size, and it can be appreciated that a great proportion of the candidate set lies within the LdP. 5.5. DOMAIN OF APPLICABILITY 253

3 Figure 5.11: Definition of three vertices used in the construction of AS in R . 254 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING

It is worth noting that in Figure 5.11, we see that a number of points lie outside of the prospective LdP. While this is demonstrative of the LdP being an unideal representation of the convex hull, it is remedied somewhat by use of the restricted affine hull, and the “halo region” associated with the RAH-LdP that it grants. Of course, a LdP that is closer to the actual convex hull of the points is preferable, but to some extent, it makes no real difference.

In Figures 5.12, 5.13 and 5.14, we give the progression of ∆ and the average error ratio with respect to the acceptance percentage for methanol (R12), NMA (R30) and the water decamer (R84), respectively. Candidate sets of 600 points have been used for each example.

1 1 Average Error Ratio Delta

0.95 0.95

0.9

0.9

0.85 Delta

0.85 Average Error Ratio Error Average 0.8

0.8 0.75

0.7 0.75 10 20 30 40 50 60 70 80 90

Acceptance Percentage

Figure 5.12: Progression of average error ratio (left abscissa) and ∆ (right abscissa) with increasing acceptance percentage for the methanol (R12).

For each of the graphs, we find that ∆ is once again essentially linear in the acceptance percentage, which is a useful trait since its value can be predicted to give a desired acceptance percentage. For methanol, we find that the RAH-LdP is 5.5. DOMAIN OF APPLICABILITY 255

2.6 1.05 Average Error Ratio 2.4 Delta

1 2.2

2 0.95

1.8

1.6 0.9 Delta

1.4 Average Error Ratio Error Average 0.85 1.2

1 0.8

0.8

0.6 0.75 10 20 30 40 50 60 70 80 90

Acceptance Percentage

Figure 5.13: Progression of average error ratio (left abscissa) and ∆ (right abscissa) with increasing acceptance percentage for the NMA (R30).

1.07 1.05

1.06 1

1.05 0.95

1.04

0.9 Delta 1.03 Average Error Ratio Error Average 0.85 1.02

0.8 1.01 Average Error Ratio Delta 1 0.75 10 20 30 40 50 60 70 80 90

Acceptance Percentage

Figure 5.14: Progression of average error ratio (left abscissa) and ∆ (right abscissa) with increasing acceptance percentage for the water decamer (R84). 256 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING a relatively successful means for approximating the DoA of a kriging model. Being able to successfully approximate the convex hull in R12 is particularly noteworthy, since the convex hull cannot be explicitly computed in R12 by use of qHull. However, it is worth mentioning that the difference in average error between those points outside and inside of the RAH-LdP is nowhere near as impressive as that obtained for water.

We see that as the dimensionality of the space increases, the RAH-LdP is a succes- sively poorer approximation to the DoA. In R30, we find that the average error ratio is only less than one when the acceptance percentage is above 80%, while in R84, the average error ratio is roughly unity across all acceptance percentages. These results are perhaps suggestive of the fact that the RAH-LdP is not an appropriate approximation to the DoA in high-dimensional spaces.

We suggest a few potential remedies to the shortcomings of the RAH-LdP as an approximation to the DoA. Of course, a more thorough description of LdP will offer a better approximate to the convex hull of the candidate set. Since construction of the LdP is a geometric problem, a number of strategies can be employed, and are commonly analytical in nature, making their computational implementation trivial. Secondly, and perhaps most importantly, we have not decoupled the quality of the training set from the complete process of defining the DoA. If the candidate set yields a poor kriging model, then kriging model technically has no DoA, i.e. it is not applicable on any domain of space. As such, any further work on the DoA would require a decoupling of the kriging model quality from the RAH-LdP quality.

Regarding the implementation of the DoA within the kriging methodology, we suggest it as a useful validation tool for a kriging model, in a similar manner to that proposed by Li and coworkers[17]. A kriging model can be used over the 5.6. CONCLUSION 257 course of a MD simulation, and those points within and outside of the DoA logged. If a great deal of points lie outside of the DoA, it suggests that the kriging model does not properly account for the physically relevant regions of conformational space, and additional training points are required. Alternatively, if only a small number of conformations lie outside of the DoA, the kriging model is essentially validated as spanning the relevant regions of conformational space. In the latter case, the kriging model is suitable for release to end-product users, i.e. those who anticipate using the kriging models as “black boxes”.

5.6 Conclusion

We have presented a number of methodological developments for the selection of an optimal training set for kriging. The greedy subset selection algorithms have been reformulated to use a novel selection function. This novel selection function results in improved accuracy of the kriging model, relative to one constructed by a random subset selection. However, work is still required to optimise this methodology and yield “ideal” training sets.

A novel subset selection algorithm, the Iterative Voronoi algorithm, has been pre- sented. However, this methodology is only viable for low-dimensional cases, and so cannot be implemented for our purposes. In spite of this, if it were possible to reduce the dimensionality of the systems in some way, it may be feasible to implement the Iterative Voronoi algorithm in the future and establish whether it is a successful subset selection methodology for larger systems.

Finally, we have presented a novel means for defining the Domain of Applicability of a machine learning model. By approximating the “restricted affine hull” from a 258 CHAPTER 5. SUBSET SELECTION FOR MACHINE LEARNING set of points by use of a greedy algorithm to construct the largest d-polytope on the points, it has been shown that one can approximately determine whether a point lies within the training set of a machine learning model in low-dimensional spaces. Further work is, however, required, to make the methodology more successful in higher dimensional spaces. The approximation of the DoA of a kriging model is of utility in our work since it allows one to ascertain whether a kriging model is predictive in those relevant regions of conformational space. Chapter 6

Raman Optical Activity

Raman optical activity (ROA) is a powerful chiroptical technique that can be used to probe a number of molecular properties. The polarisation characteristics of Rayleigh and Raman scattered photons were investigated by Atkins and Barron in 1969, leading to the conclusion that the scattered light should carry a small degree of circular polarisation depending upon the chirality of the scattering body[338]. Later work[339] by the same group remedied a number of simplifications that had been employed in the previous work, and also introduced an experimentally- detectable quantity, (I − I ) ∆ = R L , (6.0.1) (IR + IL) where ∆ is referred to as the circular intensity difference (CID), while IR and IL correspond to the intensities of the right and left-scattered light, respectively.

In the early days of its inception, ROA was used to deduce stereochemical informa- tion of molecular species by comparison with band patterns of related molecules[340]. Unlike a number of other spectroscopies, ROA intensities in the low wavenumber region of the spectrum are rich in information. These regions correspond to, for

259 260 CHAPTER 6. RAMAN OPTICAL ACTIVITY example, torsional degrees of freedom[341] and large scale conformational modes of motion[342]. ROA also possesses the capacity to determine the proportions of enantiomers within a sample, which is of great utility in purification processes[343].

The structural features of biochemical systems make them particularly amenable to ROA spectroscopy. ROA offers a far more robust analysis of biomolecules than more common forms of vibrational spectroscopy because of its sensitivity to chiral- ity. The ROA signal of a system is largely dominated by the more rigid and chiral elements of a system, corresponding to secondary structure motifs in polypeptides. This makes ROA useful for structural characterisation of complex biomolecules, in contrast to Raman spectroscopy, where amino acid sidechain signals typically dom- inate the spectrum. ROA is by no means constrained to peptide systems, and the ROA signals of other biochemical species, such as DNA bases[344] and oligomeric nucleic acids[345], also prove informative. The utility of ROA in structural studies has been exemplified over the years by a number of studies on the folding[346, 347], assembly[348] and dynamical properties[349] of biochemical systems.

A Note on Optical Activity

For the sake of brevity, we do not concern ourselves with the mathematical com- plexities of optical activity. However, the results of optical activity experiments can be quite easily correlated with molecular structure, without concerning ourselves with mathematical formalities. We introduce the parity operator, Π : x → −x, where we use x as a generic coordinate vector. We are interested in the operation of the parity operator on the electric field vector, E, and molecular conformation vector, x. Without too much trouble, we can verify that both E and x change sign under Π, and by definition are referred to as “polar vectors” (in contrast to 6.1. GENERAL THEORY 261 the notion of an “axial vector”).

The central concept that we employ is that parity is conserved for the vast majority of physical systems (the β−decay of 60Co is a notable exception[350]). Elaborating upon this, suppose we subject an experiment to the parity operator; both experi- ments, before and after the parity operator has operated, should yield physically realisable experimental results.

Let us take the experimental setup for an optically active molecule which is irra- diated by a linearly polarised light beam1. The interaction of the molecule with the light beam results in a rotation of the light to, for argument’s sake, the right. One can subsequently detect the handedness of the light by means of some ex- perimental apparatus. Under Π, the molecule is replaced by the corresponding enantiomer, and the scattered (a technical term that we define in the next section) light changes its handedness, in this case to the left. These results are experimen- tally verifiable, in that enantiomers rotate light in opposite directions[351]. Given that parity is always conserved, we can also conclude that enantiomers rotating light the same way is not a physically realisable situation.

6.1 General Theory

6.1.1 Classical Raman Scattering

Qualitatively, the theory of molecular light scattering dictates that a photon of a given frequency (and hence energy) is absorbed by the molecular system. This ab-

1Since linearly polarised light is composed a superposition of left- and right-circularly polarised light, the parity operator has no effect on the linearly polarised light beam. 262 CHAPTER 6. RAMAN OPTICAL ACTIVITY sorbtion event promotes the system into a vibrationally excited state, as exhibited by a change in the dipole moment of the system. After a short time, the photon is re-emitted, or scattered, and the molecular system returns to its “ground state”. However, since the vibrational energy levels of the system are quantised, only light of certain frequencies can result in a scattering event. We see, then, that if inci- dent light results in some change in the molecular dipole moment, the light must have promoted the system to a vibrationally excited state, and will subsequently be scattered. This explanation is the (classical) basis for vibrational spectroscopy, whereby one can observe the vibrational energy levels of a molecular system by detecting these scattering events2.

0 Consider a periodic electric field as a function of time, E(t) = E cos(ωLt), where 0 E is the amplitude of the field and ωL its angular frequency. Interaction of this field with a molecule comprising a set of point charges induces a time-dependent electric dipole moment, µ(t),

0 µα(t) = ααβEβ(t) = ααβEβ cos(ωLt) , (6.1.1) where we have invoked the Einstein summation convention, i.e. repeated indices are summed over. ααβ denotes an element of the anisotropic electric dipole polar- isability tensor3.

We have discussed at length in Section3 how the normal modes of motion form a convenient basis within which to represent molecular motion. In this spirit, we consider a minimum energy molecular conformation to be denoted by the 3N − 6 dimensional vector, Q0. We have also shown that infinitesimal displacements can

2If the incident light promotes the system into an electronically excited state, then the asso- ciated technique is termed absorption spectroscopy. Promotion of a system to an electronically excited state typically requires a great deal more energy than promotion to a vibrationally excited state, and so higher energy light is required for these spectroscopies. 3In simpler theories, the electric dipole polarisability is assumed to be isotropic, and so is a scalar. 6.1. GENERAL THEORY 263

0 be approximated harmonically, so that dQk(t) = Qk cos(ωkt), where ωk is the angular frequency of the kth normal mode.

Without wishing to be mathematically rigorous, we propose that the electric dipole polarisability is conformationally dependent. Recall that the electric dipole mo- ment corresponds to the displacement of electron density from a spherical config- uration. If the electric dipole polarisability were not conformationally dependent, the electron density would be deformed linearly in the amount of energy supplied. Intuitively, it seems reasonable that the the amount of energy required to deform the electron density should increase as the electron density is further distorted.

Keeping this argument in mind, we express the electric dipole polarisability tensor as a Taylor series in Q,

0 ∂ααβ ∂ααβ ααβ = α + dQk + dQkdQl + ... αβ ∂Q ∂Q ∂Q k Q0 k l Q0 (6.1.2) 0 ∂ααβ 0 ≈ α + Q cos(ωkt) , αβ ∂Q k k Q0 where we have truncated the Taylor series at first order, neglecting all quadratic and higher terms, which we reiterate is only valid for infinitesimal displacements.

0 We have also used the notation ααβ to denote the electric dipole polarisability tensor evaluated at Q0. Using this quantity in (6.1.1), we obtain

0 ∂ααβ 0 µα(t) = ααβEβ ≈ α Eβ + Q cos(ωkt)Eβ . (6.1.3) αβ ∂Q k k Q0 Introducing the harmonic form of the electric field,

0 0 ∂ααβ 0 0 µα(t) ≈ α E cos(ωLt) + Q cos(ωkt)E cos(ωLt) (6.1.4) αβ β ∂Q k β k Q0 By invoking the prosthaphaeresis formula for the product of cosines, we obtain " # 0 0 1 ∂ααβ 0 0     µα(t) ≈ α E cos(ωLt) + Q E cos (ωL + ωk)t + cos (ωL − ωk)t αβ β 2 ∂Q k β k Q0 (6.1.5) 264 CHAPTER 6. RAMAN OPTICAL ACTIVITY

We recall that if an event is associated with a change in the dipole moment, it signifies a promotion of the system to a vibrationally excited state. From the above equation, we see there are three distinct components that contribute to the change in dipole moment. The first term on the right hand side corresponds to the scattering of light with angular frequency ωL, i.e. the frequency of the scattered light is equal to the frequency of the incident light. This scattering event is referred to as Rayleigh scattering, and corresponds to elastic scattering.

The other two terms correspond to scattering events involving light of frequency

ωL +ωk and ωL −ωk. These are referred to as inelastic, or Raman scattering events.

For the case of ωL + ωk, the system begins in a vibrationally excited state, and upon the scattering event returns to the ground state, scattering light of higher energy than the incident light. We refer to this as “Anti-Stokes Raman scattering”.

Conversely, for ωL −ωk, the system begins in the ground state, and after scattering returns to a vibrationally excited state, scattering light of lower energy than the incident light, referred to as “Stokes Raman scattering”.

This Raman scattered light is experimentally detectable, and forms the basis of a spectroscopy that is of great use to the chemist. The Stokes scattered light is, under standard conditions (room temperature), orders of magnitude more prominent than the anti-Stokes scattered light. As such, it is the Stokes scattered light that is collected in the formation of a Raman spectrum, although in principle the anti- Stokes scattered light could be used to the same effect.

6.1.2 The Electromagnetic Field

A charge density, ρ(x), and its corresponding current density, J = ρv, where v is the velocity of the charge distribution, give rise to an electric and magnetic field, 6.1. GENERAL THEORY 265

E and B respectively, in free space. Modified fields, D and H, can be related to their free space counterparts by the relations 1 D(x, t) = 0E(x, t) H(x, t) = B(x, t). (6.1.6) µµ0

The constants 0 and µ0 are the dielectric constant and magnetic permeability of free space, respectively, whilst  and µ are their values in an isotropic medium. These constants become functions, (x) and µ(x), if the medium is anisotropic.

The four Maxwell equations in an infinite homogeneous medium have form ρ ∇ · D = ρ =⇒ ∇ · E = , (6.1.7) 0 ∇ · B = 0 =⇒ ∇ · H = 0, (6.1.8) ∂B ∂H ∇ × E = − =⇒ ∇ × D = − µµ , (6.1.9) ∂t 0 0 ∂t ∂D  ∂E ∇ × H = J + =⇒ ∇ × B = µµ J +  , (6.1.10) ∂t 0 0 ∂t each of which is of sufficient importance to merit a name: (6.1.7) is Gauss’ Theo- rem; (6.1.8) is Gauss’ Law for Magnetostatics; (6.1.9) is Faraday’s Law of Magnetic Induction; and (6.1.10) is Ampere’s Law.

Invoking the vector identity for an arbitrary vector function, Z,

∇ × (∇ × Z) = ∇ (∇ · Z) − ∇2Z, we can establish the wave equations for the electric and magnetic fields, 4 ∂2E ∇2E =  µµ , (6.1.11) 0 0 ∂t2 4 We derive the electric field wave equation from the Maxwell equations, but the magnetic field wave equation follows in a similar manner. From the vector identity, ∇ × (∇ × D) = ∇ (∇ · D) − ∇2D. By (6.1.7) and (6.1.9),   ∂H ∇ρ 2 −∇ × 0µµ0 = − ∇ D. ∂t 0 Since the spatial and temporal differential operators commute, we can combine this property 266 CHAPTER 6. RAMAN OPTICAL ACTIVITY

∂2B ∇2B =  µµ . (6.1.12) 0 0 ∂t2 Comparison with the standard wave equation, with which the reader is assumed to be sufficiently familiar that we can omit reproducing it here, leads to an expression √ for the velocity of the wave, v = 1/ 0µµ0, which equates to the speed of light in a given isotropic medium.

Solutions to the wave equation are the familiar plane waves,

E (r, t) = E0ei(k·r−ωt), (6.1.13) where E0 denotes the amplitude of the wave, ω is the angular frequency and k is termed the wavevector, and is related to the wavelength, λ, of light by the expression |k| = 2π/λ. The wavevector is directed orthogonally to the wave fronts of the wave (surfaces of constant phase).

The four Maxwell equations can be written as two equations involving the more fundamental scalar and vector potentials, φ(r, t) and A(r, t), respectively. If the divergence of a vector function is equal to zero, as in (6.1.8), the underlying vector function can be written as the curl of the vector potential, i.e.

B = ∇ × A. (6.1.14)

Then, (6.1.9) becomes ∂A  ∂A ∇ × E = ∇ × ∇ × E + = 0, ∂t ∴ ∂t with (6.1.10) 2 ∂ D ∇ρ 2 −0µµ 2 = − ∇ D. ∂t 0 The first term of the right hand side goes to zero owing to ∇ρ = 0, since the medium is homogeneous. Substitution of D for the corresponding vacuum quantity, E, leads to ∂2E − µµ = −∇2E, 0 0 ∂t2 and the wave equation follows. 6.1. GENERAL THEORY 267 which implies that E can be written as ∂A E = −∇φ(r, t) − , (6.1.15) ∂t where we have utilised the fact that the curl of the gradient field, ∇φ(r, t), is equal to zero. It should then be noted that the choice of the scalar potential, φ(r, t), is arbitrary, since it does not contribute to the form of E.

By use of (6.1.15), we find that (6.1.7) becomes  ∂A ρ ∇ · E = −∇ · ∇φ(r, t) + = ∂t 0 ∂ ρ ∇2φ(r, t) + ∇ · A = − . (6.1.16) ∂t 0 Similiary, (6.1.10) combined with (6.1.14) yields  ∂E ∇ × (∇ × A) = µµ J +  . 0 0 ∂t Then, by the vector identity we introduced above and (6.1.15)  ∂  ∂A ∇(∇ · A) − ∇2A = µµ J −  ∇φ(r, t) − 0 0 ∂t ∂t   ∂φ ∂2A = µµ J −  ∇ − 0 0 ∂t ∂t2  ∂φ ∂2A = µµ J − µµ  ∇ − . 0 0 0 ∂t ∂t2 −2 Finally, by noting that µµ00 = v , 1  ∂φ ∂2A ∇(∇ · A) − ∇2A + ∇ − = µµ J. (6.1.17) v2 ∂t ∂t2 0 The two equations, (6.1.16) and (6.1.17), completely specify four Maxwell equa- tions, (6.1.7)-(6.1.10).

A peculiarity of the vector and scalar potentials means that the transformations

A = A0 − ∇Λ(r, t) (6.1.18) ∂ φ = φ + Λ(r, t), (6.1.19) 0 ∂t 268 CHAPTER 6. RAMAN OPTICAL ACTIVITY

where Λ(r, t) is an arbitrary function and A0, φ0 correspond to the vector and scalar potentials at arbitrary points in space and time, leave the electric and magnetic field vectors unaltered. By (6.1.14)

B = ∇ × A = ∇ × (A0 − ∇Λ) = ∇ × A0, where we have already seen that the curl of the gradient of a scalar function disappears. Similarly by (6.1.15) ∂A  ∂  ∂ E = −∇φ − = −∇ φ + Λ(r, t) − (A − ∇Λ(r, t)) ∂t 0 ∂t ∂t 0 ∂ = −∇φ − A . 0 ∂t 0 This property is termed gauge invariance, and suggests the existence of free vari- ables which can be fixed by placing restrictions on the vector and scalar poten- tials. One such choice of restrictions is termed the Coulomb gauge, such that

2 ∇ Λ(r, t) = ∇ · A0, then by the gauge transformation for the vector potential,

∇ · A = ∇ · A0 − ∇ · ∇Λ(r, t)

2 = ∇ · A0 − ∇ Λ(r, t) = ∇ · A0 − ∇ · A0, leaving the Coulomb gauge condition

∇ · A = 0. (6.1.20)

6.1.3 Hamiltonian in an Electromagnetic Field

The classical Lagrangian for a set of particles in an electromagnetic field is given by

N X h1 2 i L (r, r˙, t) = T (r˙, t) − V (r, r˙, t) = mir˙ − qiΦ(ri, t) + qir˙ i · A(ri, t) − V (ri) , 2 i i=1 (6.1.21) 6.1. GENERAL THEORY 269

th where qi is the charge of the i particle, T and V are the kinetic and poten- tial energies, respectively, V (ri) is some external potential and A and Φ are the vector and scalar potentials describing the electromagnetic field we have intro- duced in Section 6.1.2. To obtain the Hamiltonian of the system, we perform the Legendre transformation, which necessitates evaluation of the canonical momen- tum conjugate to a coordinate rα, pα. This is readily achieved by invoking the Euler-Lagrange equation

∂L 1 ∂ 2 ∂ pα = = mir˙α + qir˙αAα(ri, t) ∂r˙α 2 ∂r˙α ∂r˙α (6.1.22)

= mir˙α − qiAα(ri, t), where the index α denotes a generalised coordinate and the index i denotes the atom to which the generalised coordinate belongs. The Hamiltonian is subse- quently defined as

N X H (r, p, t) = pi · r˙ i − L (r, r˙, t) i=1 (6.1.23) N " # X 1 2 = p − q A(r , t) + q Φ(r , t) + V (r ) . 2m i i i i i i i=1 i Expanding the above quantity, we obtain

N " X 1  2 2 2  H (r, p, t) = p − qipi · A(ri, t) − qiA(ri, t) · pi + q A (ri, t) 2m i i i=1 i #

+ qiφ(ri, t) + V (ri) .

(6.1.24)

Since we are attempting to describe a quantum phenomenon, we employ the cor- respondence theorem, in which the momenta are replaced by their quantum me- 270 CHAPTER 6. RAMAN OPTICAL ACTIVITY

5 chanical operators, pi = −i~∇i . As such,

N " 2 2 X ∇i i~qi i~qi qi 2 H (r, p, t) = − + ∇i · A(ri, t) + A(ri, t) · ∇i + A (ri, t) 2m 2m 2m 2m i=1 i i i i #

+ qiφ(ri, t) + V (ri)

(6.1.25)

By working in the Coulomb gauge defined in (6.1.20), we can set the term con-

2 taining qi∇i · A(ri, t) = 0. We also omit terms proportional to A (ri, t) and higher since we assume the external field is of low intensity. Rearranging to separate the Hamiltonian into a sum of unperturbed and perturbed Hamiltonians, and using the classical quantities r and p once more,

H (r, p, t) = H0 + Hint N N X  1  X  1  = p2 + V (r ) + q φ(r , t) − q A(r , t) · p , 2m i i i i 2m i i i i=1 i i=1 i (6.1.26) where Hint is the first order perturbation to the ground state Hamiltonian, H0.

Our next step involves the expression of the perturbing fields, E(r, t) and B(r, t) as Taylor series about some origin, r0, i.e.,

Eα(r, t) = Eα(r0, t) + rβ∇βEα(r0, t) + ... , (6.1.27) Bα(r, t) = Bα(r0, t) + rβ∇βBα(r0, t) + ... , where we use the Greek indices α, β, ... to denote a Cartesian degree of freedom. We also introduce the Einstein summation convention, where repetition of an index in a term implies summation over all degrees of freedom. Since (6.1.26) is

5 Technically, the positions ri should be also exchanged for their quantum mechanical opera- tors, ˆri. However, this substitution does nothing to change the underlying mathematics, and so is omitted here. 6.1. GENERAL THEORY 271 expressed in terms of the vector and scalar potentials, we require that (6.1.27) also be reformulated in terms of φ(r, t) and A(r, t). Barron and Gray[352] have found that one can choose the scalar and vector potentials to be of the form 1 φ(r, t) = φ(r0, t) − rαEα(r0, t) − rαrβ∇αEβ(r0, t) + ... , 2 (6.1.28) 1 1 A(r, t) =  B (r , t)r +  r ∇ B (r , t)r + ... , 2 αβγ β 0 γ 3 αγδ β β γ 0 δ where αβγ is the rank three Levi-Civita symbol, and is equal to 1 if the parity of the permutation of indices is positive, and -1 otherwise. For the following, we omit the zeroth order scalar potential, φ(r0, t), as it can be accounted for later as an additive term. Placing these into the expression for Hint in (6.1.26), we obtain

N X h 1 i Hint = qi − ri,αEα(r0, t) − ri,αri,β∇αEβ(r0, t) + ... 2 i=1 N X qipi,α h1 1 i (6.1.29) −  B (r , t)r +  r ∇ B (r , t)r + ... 2m 2 αβγ β 0 i,γ 3 αγδ i,β β γ 0 i,δ i=1 i

= −µαEα(r0, t) − Ξαβ∇αEβ(r0, t) − mαBα(r0, t) + ... , where we have used three multipole moments for shorthand,

N X µα = qiri,α, (6.1.30) i=1 N 1 X Ξ = q r r , (6.1.31) αβ 2 i i,α i,β i=1 N X qi m =  r p , (6.1.32) α 2m αβγ i,β i,α i=1 i the electric dipole, electric quadrupole and magnetic dipole moments, respectively. It is worth revising (6.1.31) to account for the fact that the quadrupole moment is traceless, N 1 X   Θ = q 3r r − r2δ , (6.1.33) αβ 2 i i,α i,β i αβ i=1 272 CHAPTER 6. RAMAN OPTICAL ACTIVITY and (6.1.29) becomes

1 Hint = −µαEα(r0, t) − Θαβ∇αEβ(r0, t) − mαBα(r0, t) + ... , (6.1.34) 3

1 where the factor of 3 cancels the factor of 3 in (6.1.33).

6.1.4 Wavefunction Perturbed by Periodic Field

We write the time-dependent Schr¨odinger equation

 ∂  i − H0 |ψi = Hint |ψi (6.1.35) ~∂t where we have already encountered

N X  1 2  H0 = p + V (ri) (6.1.36) 2m i i=1 i  1  Hint = − µαEα(r0, t) − Θαβ∇αEβ(r0, t) − mαBα(r0, t) + ... (6.1.37) 3

Herein, we drop the convention of explicitly specifying the arguments of the func- tions E(r0, t) and B(r0, t), instead using the shorthand (E)0 and (B)0, respectively.

We note that by setting Hint = 0, we recover the time-independent Schr¨odinger equation, the solution of which takes the form

X X |ψi = cn |φni exp(−iωnt) = cn |ψni (6.1.38) n n

th where |φni denotes the n eigenfunction with accompanying eigenvalue of ~ωn,

ωn being the angular frequency. cn is an expansion coefficient, where the index n runs over all eigenfunctions of the system. For the remainder of our discussion, we attempt to make our arguments as general as possible, and so assume that the wavefunction and perturbing fields are complex valued. For our purposes, the 6.1. GENERAL THEORY 273 perturbing electric and magnetic fields take the form of single-frequency harmonic plane waves, with angular frequency ωL,

0h i 0 (Eβ)0 = E exp(iκ · r) exp(−iωLt) = Eβ exp(−iωLt) (6.1.39) β ¯ ¯0 (Eβ)0 = Eβ exp(iωLt) (6.1.40)

0 (Bβ)0 = Bβ exp(−iωLt) (6.1.41) ¯ ¯0 (Bβ)0 = Bβ exp(iωLt) (6.1.42)

0 0 where Eβ and Bβ are the complex β−components of the electric and magnetic ¯ ¯ field amplitudes. We have also used the bar notation, (Eβ)0, (Bβ)0, to denote the complex conjugate. Since we deal explicitly with rank-1 tensors, we have no need to adopt the conjugate transpose notation that is frequently incurred in the literature. We shall make a great deal of use of the following identities

1h i 1h i (E ) = (E ) + (E¯ ) = E0 exp(−iω t) + E¯0 exp(iω t) (6.1.43) β 0 2 β 0 β 0 2 β L β L iω h i iω h i (E˙ ) = − L (E ) − (E¯ ) = − L E0 exp(−iω t) + E¯0 exp(iω t) (6.1.44) β 0 2 β 0 β 0 2 β L β L where we have used the familiar dot notation to denote a temporal derivative. One can form the wavefunction of a system perturbed by external fields as an expansion

0 of each eigenfunction, |ψni[353, 354], h 0 aβ bβ ¯ cβ |ψni = |ψni + |ψn i (Eβ)0 + |ψn i (Eβ)0 + |ψn i (Bβ)0 i (6.1.45) dβ ¯ eβγ fβγ ¯ + |ψn i (Bβ)0 + |ψn i ∇β(Eγ)0 + |ψn i ∇β(Eγ)0 + ... exp(−iωnt) with

xβ X β |ψn i = xjn |ψji x = a, b, c, d (6.1.46) j6=n

xβγ X βγ |ψn i = xjn |ψji x = e, f (6.1.47) j6=n 274 CHAPTER 6. RAMAN OPTICAL ACTIVITY

β βγ where |ψji is an excited state wavefunction and xjn and xjn are expansion co- efficients. We are now in a position to develop solutions to (6.1.35). We deal initially with the temporal derivative term of (6.1.45). Since (6.1.46) and (6.1.47) are time-independent, we need only concern ourselves with temporal derivatives of the fields and exponential terms ∂ h i |ψ0 i = ω |ψ i + |ψaβi (E ) + |ψbβi (E¯ ) + |ψcβi (B ) + |ψdβi (B¯ ) ~∂t n ~ n n n β 0 n β 0 n β 0 n β 0 i eβγ fβγ ¯ + |ψn i ∇β(Eγ)0 + |ψn i ∇β(Eγ)0 + ... exp(−iωnt) h aβ bβ ¯ cβ dβ ¯ + ~ωL |ψn i (Eβ)0 − |ψn i (Eβ)0 + |ψn i (Bβ)0 − |ψn i (Bβ)0 i eβγ fβγ ¯ + |ψn i ∇β(Eγ)0 + |ψn i ∇β(Bβ)0 + ... exp(−iωnt) (6.1.48) where we have used (6.1.43) to obtain temporal derivatives of the electric and magnetic fields. Expressing the multipole interaction Hamiltonian term, (6.1.37), by use of (6.1.43), we obtain

1h ¯ i 1h ¯ i 1h ¯ i Hint = −µβ (Eβ)0+(Eβ)0 −Θβγ ∇β(Eγ)0+∇β(Eγ)0 −mβ (Bβ)0+(Bβ)0 +... 2 6 2 (6.1.49) We have suggestively changed the indices relative to (6.1.37) for later convenience. Our final step involves expression of (6.1.35) in terms of these new quantities, 6.1. GENERAL THEORY 275 yielding the unwieldy expression h aβ bβ ¯ cβ dβ ¯ ~ωn |ψni + |ψn i (Eβ)0 + |ψn i (Eβ)0 + |ψn i (Eβ)0 + |ψn i (Bβ)0+ i eβγ fβγ ¯ |ψn i ∇β(Eγ)0 + |ψn i ∇β(Eγ)0 + ... exp(−iωnt)+ h aβ bβ ¯ cβ dβ ¯ ~ωL |ψn i (Eβ)0 − |ψn i (Eβ)0 + |ψn i (Bβ)0 − |ψn i (Bβ)0+ i eβγ fβγ ¯ |ψn i ∇β(Eγ)0 − |ψn i ∇β(Eγ)0 + ... exp(−iωnt)− h aβ bβ ¯ cβ dβ ¯ H0 |ψni + |ψn i (Eβ)0 + |ψn i (Eβ)0 + |ψn i (Eβ)0 + |ψn i (Bβ)0+ i eβγ fβγ ¯ |ψn i ∇β(Eγ)0 + |ψn i ∇β(Eγ)0 + ... exp(−iωnt) h 1h i 1h i 1h i i = − µ (E ) + (E¯ ) − Θ ∇ (E ) + ∇ (E¯ ) − m (B ) + (B¯ ) + ... β 2 β 0 β 0 βγ 6 β γ 0 β γ 0 β 2 β 0 β 0 h aβ bβ ¯ cβ dβ ¯ |ψni + |ψn i (Eβ)0 + |ψn i (Eβ)0 + |ψn i (Eβ)0 + |ψn i (Bβ)0+ i eβγ fβγ ¯ |ψn i ∇β(Eγ)0 + |ψn i ∇β(Eγ)0 + ... exp(−iωnt) (6.1.50)

Collecting terms involving the same field components, we are able to find analytical expressions for the expansion coefficients of (6.1.45), expressed in (6.1.46) and

(6.1.47). So, collecting terms involving (Eβ)0, we obtain the expression

aβ aβ aβ µβ ωn |ψ i + ωL |ψ i − H0 |ψ i = − |ψni (6.1.51) ~ n ~ n n 2

Multiplying through by hψj|, h i 1 ω + ω hψ |ψaβi = − hψ | µ |ψ i (6.1.52) ~ nj L j n 2 j β n

aβ β where ωjn = ωj − ωn. From (6.1.46), we see that hψj|ψn i = ajn, so that

β hψj| µβ |ψni ajn = (6.1.53) 2~(ωjn − ωL) which allows us to write

aβ X hψj| µβ |ψni |ψn i = |ψji . (6.1.54) 2 (ωjn − ωL) j6=n ~ 276 CHAPTER 6. RAMAN OPTICAL ACTIVITY

In an entirely equivalent way, we are able to group the other terms involving ¯ ¯ ¯ individual field components, i.e. (Eβ)0, (Bβ)0, (Bβ)0, ∇β(Eγ)0 and ∇β(Eγ)0, to yield the other expansion coefficients

bβ X hψj| µβ |ψni |ψn i = |ψji , (6.1.55) 2 (ωjn + ωL) j6=n ~

cβ X hψj| mβ |ψni |ψn i = |ψji , (6.1.56) 2 (ωjn − ωL) j6=n ~

dβ X hψj| mβ |ψni |ψn i = |ψji , (6.1.57) 2 (ωjn + ωL) j6=n ~

eβγ X hψj| Θβγ |ψni |ψn i = |ψji , (6.1.58) 6 (ωjn − ωL) j6=n ~

fβγ X hψj| Θβγ |ψni |ψn i = |ψji . (6.1.59) 6 (ωjn + ωL) j6=n ~

6.1.5 The Raman Optical Activity Tensors

Given the form of our perturbed wavefunction, we are now in a position to evaluate expectation values of the multipole moment operators. For example, the dipole moment arising from the perturbation associated with the electromagnetic field in the nth eigenstate can be expressed as

0 0 h aβ aβ i hψn| µˆα |ψni − hψn| µˆα |ψni = hψn| µˆα |ψn i + hψn | µˆα |ψni (Eβ)0 h i (6.1.60) bβ bβ ¯ + hψn| µˆα |ψn i + hψn | µˆα |ψni (Eβ)0 + ... 6.1. GENERAL THEORY 277

Dealing first with the terms involving kets of perturbed eigenstates, we can make use of (6.1.54) and (6.1.55), in addition to the identities (6.1.43) and (6.1.44)

aβ bβ ¯ hψn| µˆα |ψn i (Eβ)0 + hψn| µˆα |ψn i (Eβ)0 X hψj| µˆβ |ψni X hψj| µˆβ |ψni ¯ = hψn| µˆα |ψji (Eβ)0 + hψn| µˆα |ψji (Eβ)0 2 (ωjn − ωL) 2 (ωjn + ωL) j6=n ~ j6=n ~ " ¯ # X hψn| µˆα |ψji hψj| µˆα |ψni (Eβ)0 (Eβ)0 = + 2 (ωjn − ωL) (ωjn + ωL) j6=n ~ " ¯  ¯ # X hψn| µˆα |ψji hψj| µˆα |ψni ωjn (Eβ)0 + (Eβ)0 + ωL (Eβ)0 − (Eβ)0 = 2 (ω2 − ω2 ) j6=n ~ jn L " ˙ # X hψn| µˆα |ψji hψj| µˆα |ψni ωjn(Eβ)0 − i(Eβ)0 = (ω2 − ω2 ) j6=n ~ jn L (6.1.61)

Similarly, for the eigenbras

aβ bβ ¯ hψn | µˆα |ψni (Eβ)0 + hψn | µˆα |ψni (Eβ)0 " ˙ # (6.1.62) X hψn| µˆα |ψji hψj| µˆα |ψni ωjn(Eβ)0 + i(Eβ)0 = (ω2 − ω2 ) j6=n ~ jn L so we can sum the two and obtain a fully analytical expression for the perturbed dipole moment (momentarily including only the contributions from the real and complex components of the electric field) " # X ωjn hψn| µˆα |ψji hψj| µˆβ |ψni X ωjn hψn| µˆβ |ψji hψj| µˆα |ψni + (E ) (ω2 − ω2 ) (ω2 − ω2 ) β 0 j6=n ~ jn L j6=n ~ jn L " # X i hψn| µˆα |ψji hψj| µˆβ |ψni X i hψn| µˆβ |ψji hψj| µˆα |ψni − + (E˙ ) + ... (ω2 − ω2 ) (ω2 − ω2 ) β 0 j6=n ~ jn L j6=n ~ jn L (6.1.63)

Invoking the Hermiticity of the multipole moment operators, i.e.

hψn| µˆβ |ψji hψj| µˆα |ψni = hψn| µˆα |ψji · hψj| µˆβ |ψni 278 CHAPTER 6. RAMAN OPTICAL ACTIVITY we can rewrite (6.1.63) as " # X ωjn hψn| µˆα |ψji hψj| µˆβ |ψni X ωjnhψn| µˆα |ψjihψj| µˆβ |ψni + (E ) (ω2 − ω2 ) (ω2 − ω2 ) β 0 j6=n ~ jn L j6=n ~ jn L " # X i hψn| µˆα |ψji hψj| µˆβ |ψni X ihψn| µˆα |ψjihψj| µˆβ |ψni − + (E˙ ) + ... (ω2 − ω2 ) (ω2 − ω2 ) β 0 j6=n ~ jn L j6=n ~ jn L (6.1.64) and invoking the definition of the real and complex parts of a complex number, z = x + iy z +z ¯ z − z¯ x = Re y = Im 2 2i we are left with the expression

0 0 1 0 ˙ hψn| µˆα |ψni = (ααβ)nn(Eβ)0 + (ααβ)(Eβ)0 (6.1.65) ωL

−1 where the factor ωL has been introduced to enforce dimensional consistency, and the electric dipole polarisability tensors have been introduced

2 X ωjn (α ) = Re hψ | µˆ |ψ i hψ | µˆ |ψ i (6.1.66) αβ nn ω2 − ω2 n α j j β n ~ j6=n jn L 2 X ωL (α0 ) = − Im hψ | µˆ |ψ i hψ | µˆ |ψ i (6.1.67) αβ nn ω2 − ω2 n α j j β n ~ j6=n jn L

By use of the terms involving the electric field gradient and magnetic field, it can be shown in a similar way that (6.1.65) becomes

0 0 1 0 ˙ 1 hψn| µˆα |ψni = (ααβ)nn(Eβ)0 + (ααβ)(Eβ)0 + (Aα,βγ)nn∇β(Eγ)0 ωL 3 1 0 ˙ 1 0 ˙ + (Aα,βγ)nn∇β(Eγ)0 + (Gαβ)nn(Bβ)0 + (Gαβ)nn(Bβ)0 + ... 3ωL ωL (6.1.68) 6.1. GENERAL THEORY 279 where we have introduced a number of higher order optical activity tensors

2 X ωjn (A ) = Re hψ | µˆ |ψ i hψ | Θˆ |ψ i (6.1.69) α,βγ nn ω2 − ω2 n α j j βγ n ~ j6=n jn L 2 X ωL (A0 ) = − Im hψ | µˆ |ψ i hψ | Θˆ |ψ i (6.1.70) α,βγ nn ω2 − ω2 n α j j βγ n ~ j6=n jn L 2 X ωjn (G ) = Re hψ | µˆ |ψ i hψ | mˆ |ψ i (6.1.71) αβ nn ω2 − ω2 n α j j β n ~ j6=n jn L 2 X ωL (G0 ) = − Im hψ | µˆ |ψ i hψ | mˆ |ψ i (6.1.72) αβ nn ω2 − ω2 n α j j β n ~ j6=n jn L

The induced electric quadrupole and magnetic dipole moments can be found in a similar fashion, yielding expressions that resemble (6.1.68)

0 ˆ 0 1 0 ˙ hψn| Θαβ |ψni = (Aγ,αβ)nn(Eγ)0 − (Aγ,αβ)nn(Eγ)0 + ... (6.1.73) ωL 0 0 1 0 ˙ hψn| mˆ α |ψni = (Gβα)nn(Eβ)0 − (Gβα)nn(Eβ)0 + ... (6.1.74) ωL where we have truncated the expressions so as to not introduce any more optical activity tensors that appear in higher order terms.

6.1.6 ROA Intensities

By certain combinations of the optical activity tensors, we are able to construct optical activity tensors that are invariant with respect to the choice of frame of reference. These quantities are denoted the isotropic invariants of the optical activity tensors, and can be combined to yield the ROA intensities[355]. Following the nomenclature used by Polavarapu[356], the numerator and denominator of the 280 CHAPTER 6. RAMAN OPTICAL ACTIVITY

CID expression in (6.0.1) are given by " # R L 48ω 1 2 1 2 I − I = γj + δj (6.1.75) c ωL 3ωL " # R L 2 2 I + I = 2 45¯αj + 7βj , (6.1.76)

2 2 2 2 where c is the speed of light. The termsα ¯j , βj , γj , δj are the isotropic invariants of the optical activity tensors, and are defined as various combinations of partial derivatives of the optical activity tensors, ! 2 1 ∂αxx ∂αyy ∂αzz α¯j = + + , (6.1.77) 9 ∂Qj ∂Qj ∂Qj

( !2 !2 !2 2 1 ∂αxx ∂αyy ∂αxx ∂αzz ∂αyy ∂αzz βj = − + − + − 2 ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj (6.1.78) " !2 !2 !2#) ∂α ∂α ∂α + 6 xy + yz + xz , ∂Qj ∂Qj ∂Qj

( ! 0 0 ! ! 0 0 ! 2 1 ∂αxx ∂αyy ∂Gxx ∂Gyy ∂αxx ∂αzz ∂Gxx ∂Gzz γj = − − + − − 2 ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj ! ! " ! ∂α ∂α ∂G0 ∂G0 ∂α ∂G0 ∂G0 + yy − zz yy − zz + 3 xy xy + yx ∂Qj ∂Qj ∂Qj ∂Qj Qj ∂Qj ∂Qj ! !#) ∂α ∂G0 ∂G0 ∂α ∂G0 ∂G0 + xz xz + zx + + yz yz + zy , Qj ∂Qj ∂Qj Qj ∂Qj ∂Qj (6.1.79) 6.2. PREVIOUS COMPUTATIONAL WORK 281

" ! ! 2 ωL ∂αyy ∂αxx ∂Azxy ∂αxx ∂αzz ∂Ayzx δj = − + − 2 ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj ! ! ∂α ∂α ∂A ∂α ∂A ∂A ∂A ∂A + zz − yy xyz + xy yyz − zyy + zxx − xxz ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj ! ∂α ∂A ∂A ∂A ∂A ∂α ∂A ∂A + xz yzz − zzy + xxy − yxx + yz zzx − xzz ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj ∂Qj !# ∂A ∂A + xyy − yyz , ∂Qj ∂Qj (6.1.80) where derivatives are performed with respect to the normal coordinates of the system. We qualitatively discuss the computation of these derivatives in Section 6.2.1.

6.2 Previous Computational Work

6.2.1 Algorithmic Developments

The computation of ab initio Raman spectra has been possible for a number of years. The electric dipole-electric dipole polarisability tensor, ααβ, for a molecular system can be expressed quantum mechanically by truncating the infinite sum- over-states of (6.1.66) to finite order[354]. Rewriting (6.1.66) for convenience,

2 X ωjn (α ) = Re hψ | µˆ |ψ i hψ | µˆ |ψ i αβ nn ω2 − ω2 n α j j β n ~ j6=n jn L We assume that the incident excitation frequency does not induce resonance within the system, which can be enforced by taking ωL  ωjn. Taking this inequality as opposed to ωL  ωjn ensures that we do not promote the system to a higher excitation level. This nonresonant approximation allows us to characterise (ααβ)nn 282 CHAPTER 6. RAMAN OPTICAL ACTIVITY

2 2 as independent of the incident excitation, i.e. the “static limit”, where ωjn − ωL ≈ 2 ωjn, and obtain

X 2 (ααβ)nn = Re hψn| µˆα |ψji hψj| µˆβ |ψni (6.2.1) Ej − En j6=n where Ej is the energy of eigenstate |ψji. Since the vast majority of biochemical systems (excluding transition metal complexes, which we do not discuss here) do not exhibit strong correlation, the gap between the ground and higher energy states is typically very large[357], which then allows us to truncate the expression (6.2.1) to low order, since excited states do not contribute significantly to the molecular polarisability[358].

Derivatives of (ααβ)nn with respect to the internal degrees of freedom of the molec- ular system have historically been evaluated numerically by computing (ααβ)nn at a number of geometrically displaced molecular geometries[359]. Taking the ge- ometric displacement to be small, one can invoke a finite difference method to obtain numerical derivatives.

As we have seen in Section3, the normal coordinates form a convenient internal basis with which to define a molecular system. Working with the harmonic ap- proximation for the molecular potential energy surface, the normal coordinates are eigenfunctions of the Hessian. It is possible to work with higher accuracy methods that use anharmonic corrections to the molecular energy. However, anharmonic corrections are computationally cumbersome, and have only been utilised in small molecular systems, such as methyloxirane[360]. Indeed, calculation of the Hessian alone is a computationally intensive procedure. A number of techniques are avail- able for avoiding explicit computation of the full Hessian, where only those modes with the highest intensity are evaluated by the “intensity-tracking” technique of Reiher and coworkers[361, 362], but are not widely implemented. 6.2. PREVIOUS COMPUTATIONAL WORK 283

More recent developments have allowed for the incorporation of electric field per- turbations into the molecular Hamiltonian[363]. Since (ααβ)nn results from the second derivative of the molecular energy with respect to an applied electric field, third order derivatives of the Hamiltonian with respect to the electric field yield the spatial derivatives of (ααβ)nn. Analytical expressions for this Hamiltonian are rarely available, and so derivatives with respect to internal degrees of freedom can be found as above, by displacing the molecular geometry and evaluating (ααβ)nn at these displaced geometries, followed by use of a finite differences method. Modern techniques, however, rely on analytical expressions for the spatial derivatives of

(ααβ)nn with respect to the internal molecular degrees of freedom[364].

It is only more recent developments that have allowed for the calculation of ab initio ROA spectra. From (6.1.79) and (6.1.80), we see that in addition to the derivatives of (ααβ)nn, we also require derivatives of the electric dipole-magnetic 0 dipole and electric dipole-electric quadrupole polarisability tensors, Gαβ and Aαβγ, respectively. Whilst derivatives of the latter involve relatively straight-forward ex-

0 tensions to the derivatives of (ααβ)nn, it is derivatives of Gαβ that have historically required numerical treatment.

The seminal contribution to this field is that of Polavarapu[356]. By using an

0 expression for Gαβ in the static limit, i.e. frequency-independent form of the 0 tensor[365], Gαβ can be evaluated at a number of displaced geometries, and nu- merical derivatives taken at the Hartree-Fock level of theory. Later work extended the approach of Polavarapu to the DFT level of theory[366] and to frequency- dependent polarisabilities [367]. It is only the relatively recent developments of Li´egoiset al.[368] that have permitted a fully analytical approach to the compu- tation of ROA spectra, where an analytical derivative procedure was introduced at the Hartree-Fock level of theory. DFT methods are also supported in modern 284 CHAPTER 6. RAMAN OPTICAL ACTIVITY quantum chemical software packages[369].

6.2.2 Solvation Effects

The fact that biochemical systems are stabilised by solvent interactions makes their modelling problematic. Since the solvent plays an important role in dictating the energetics of a molecular system, it becomes imperative that spectroscopic calculations include some level of solvent modelling. It has been shown that simply omitting the solvent from spectroscopic calculations yields poor quality results that severely digress from experiment[370]. Indeed, when dealing with charged species, such as zwitterionic amino acids, the molecules are simply not energetically stable if the solvent is not included[371], and simply optimise to the neutral amine- carboxylic acid species in the absence of constraints.

The two major models for solvation effects in ab initio calculations are implicit and explicit solvation. With the former[372], the solvent environment is modelled as a homogeneous dielectric medium, thus increasing the magnitude of electro- static interactions relative to the system in vacuo. The Polarisable Continuum Model[33] (PCM) and Conductor-like Screening Model[34] (COSMO) have been effective in predicting molecular energies[373], but lack any atomic-level properties of the solvent, such as the presence of solvent hydrogen bond donors/ acceptors. Implicit solvation methods have been found to offer minimal benefits in comput- ing vibrational frequencies, and therefore vibrational spectra, relative to gas phase calculations[374].

A number of explicit solvation schemes have been investigated. At the lowest end, one can include a small number of solvent molecules and treat the entire solute-solvent system quantum mechanically, perhaps embedded in an implicit 6.2. PREVIOUS COMPUTATIONAL WORK 285 solvation field. The number of explicit solvent molecules treated in this way is somewhat arbitrary, and their positioning around the solute can be problematic. Solvent molecules can be added in an ad hoc fashion around polar (or non-polar if the solvent is non-polar) groups, leading to significant improvement over gas phase spectra[375]. However, a number of problems can also be attributed to this methodology, such as poor geometrical optimisation of the system as a result of its distance from a minimum on the potential energy surface[376].

Another more successful method is the use of QM/MM, where the solvent is treated classically and the solute quantum mechanically. This has proven to be a very adept methodology for calculating spectra. For example, Cheeseman et al.[299] have predicted the methyl−β − d−glucose ROA spectrum to an impressive level of accuracy, reproducing the vast majority of spectral features. More recently, this approach has been further validated on a number of other systems, such as glucuronic acid[377] and β − d−xylose[378]. However, one is still constrained in selecting a conservative number of solvent molecules to include in the calculation for it to remain tractable; simply including hundreds of solvent molecules in the QM/MM calculation is not routinely feasible.

A complexity in the QM/MM method of modelling solvent arises in the geometri- cal optimisation of the system. For the normal modes of motion of the system to be correctly determined, the system is required to relax to an energetic minimum on the potential energy surface. Geometry optimisation is notoriously difficult for a system including even a conservative number of explicit solvent molecules, as modelled in [377, 378]; the presence of water molecules in these studies flattens the potential energy surface, rendering the location of well-defined minima diffi- cult. Two methods were proposed; OptAll and OptSolute. The OptAll scheme optimises the entire QM/MM system, while the OptSolute method 286 CHAPTER 6. RAMAN OPTICAL ACTIVITY freezes all solvent molecules and optimises only the solute in the field of the static solvent. The latter method is naturally quicker than the former, but was found to yield poorer quality spectra. Some trade-off between accuracy and computational tractability is to be expected, but without there being an accepted methodology for qualitatively comparing predicted spectra to experimental spectra, it is difficult to ascertain whether the improvement in spectral quality warrants the additional computational efforts.

6.2.3 Quantitative Measures of Similarity

A note is in place on the absence of a quantitative measure of spectral quality in the literature. Typically, the comparison of experimental and computed spectra is reduced to a “dark art”, where experts cast their eye over two spectra and ascertain the prediction quality with vague statements, such as “good agreement”. Spectra are composed of highly complex patterns, and so validation in this crude manner is particularly unsatisfactory. Ideally, one would like to reduce the determination of the quality of prediction to some unambiguous numerical procedure.

The difference in absolute amplitudes between computed and experimental spec- tra is irrelevant since the amplitude of an experimental spectrum is a function of collection time. As such, one requires that the absolute amplitudes of the experi- mental and computed spectra be normalised for a direct comparison to be made. Relative amplitudes are, however, informative.

An intuitive approach to quantitatively comparing experimental and computed spectra is to repurpose the Carb´oindex[379, 380]. The Carb´oindex was initially proposed in the context of ascertaining the similarity in electron density between two molecules, and subsequently correlating this similarity by a structure-activity 6.2. PREVIOUS COMPUTATIONAL WORK 287 relationship. Whilst heavily cited, the notion of the Carb´oindex is perhaps unsat- isfying as it requires the interpretation of tiny differences (O(10−3)) in molecular similarity to be related to gross differences in molecular activity. The Carb´oindex,

Sfg, has functional form

R PN ! f(x)g(x)dx i=1 f(νi)g(νi) Sfg = q ≈ q , (6.2.2) R 2 R 2 PN 2 PN 2 f(x) dx g(x) dx i=1 f(νi) j=1 g(νj) where f(x) and g(x) are arbitrary functions. One can immediately interpret Sfg based on the numerator, which is the inner product (or “overlap” in quantum me- chanical parlance) of two functions, appropriately normalised by the denominator so that 0 ≤ Sfg ≤ 1, Sfg = 0 denoting no similarity and Sfg = 1 denoting equiv- alence between the functions. Neither experimental or computed spectra have analytical form, so the integrals in the above equation require discretisation. The approximation in parentheses in the above equation denotes the discrete form of the Carb´oindex, which has been repurposed to denote the similarity between two spectra, f(ν) (experimental) and g(ν) (computed), the independent variable being the wavenumber[381, 382].

However, the Carb´oindex is not immediately applicable in this form. For exam- ple, if experimental and computed spectra had the exact same form, but one was shifted along the abscissa relative to the other by some small amount, the Carb´o index could classify the spectra as unsimilar. In reality, the two spectra are ex- actly equivalent but for some shift that could have be resultant from a particular experimental setup.

To remedy this failing, one can introduce a number of free parameters into the evaluation of f(ν) and g(ν) at the specified points. Maximising the Carb´oindex with respect to these free parameters then corresponds to the similarity between two spectra. One free parameter that immediately springs to mind is that intro- 288 CHAPTER 6. RAMAN OPTICAL ACTIVITY duced by Radom[383], who noticed that computed spectra are overestimated in the abscissa inter alia the harmonic approximation, and so the computed spectrum should be scaled. i.e. g(αν), with 0.95 ≤ α ≤ 1.0.

Another free parameter could correspond to a shifting, so that the computed spec- trum is g(αν + β), with β some global spectrum shift. One could select α, β to be anisotropic, so that they scale and shift different regions of the computed spectrum by different amounts. The possibilities are virtually endless, and at some point the optimisation of free parameters simply leads to the computed spectrum becoming infinitely plastic. In this case, the similarity between computed and experimental spectra becomes meaningless.

In our work, we have regrettably avoided use of a similarity measure between com- puted and experimental spectra since it is not immediately clear how to introduce free parameters without rendering the similarity metric questionable. However, this is certainly an avenue that can be explored in the future, and would be of undoubted use to the community.

6.2.4 Large Biomolecular Systems

In the previous section, we have restricted our discussion to small molecular sys- tems. However, biomolecular systems are frequently composed of large polymers, such as polypeptides, whose length frequently spans over 100 amino acid residues in length. It is hugely challenging to model the ROA spectrum of such a large system given the computational complexities associated with calculating ROA spectra. However, a number of groups have approached this problem with a great degree of success. 6.2. PREVIOUS COMPUTATIONAL WORK 289

Early ROA studies of polymeric systems have included helical decaalanine[289] (10 amino acids), valinomycin[384] (12 amino acids) and the β−domain of met- allothionein [385] (31 amino acids), where the latter was modelled in vacuo. The results obtained were impressive, given that these studies were all performed before

0 analytical derivatives of Gαβ were available.

A particularly impressive tool is the Cartesian Coordinate Tensor Transfer (CCT) methodology[386], which resembles the notion of the Kabsch RMSD that we em- ploy in Section 6.3.3. The basic idea is the fragmentation of a large system into a number of small subsystems. The unitary transformation from the subsystem to the large system is found by minimising the associated RMSD between the two systems. The unitary transformation matrix can then be used to transform quantities computed in the subsystem, such as the Hessian and optical activity tensors, into the large system, exploiting the apparent locality of these quantities. CCT has been used in the calculation of vibrational circular dichroism spectra of oligonucleotides[387] and IR spectra of polypeptides[388], with results that are almost indistinguishable from the fully ab initio spectrum. Similarly impressive results have been obtained for insulin[389] (51 amino acids).

In spite of the excellent results obtained through CCT, there have been a number of studies which refute the validity of CCT for ROA spectra[390, 391]. In short, the

0 authors claim that the ROA Gαβ and Aα,βγ tensors are not properly transformed by CCT. The criticisms have been rebutted in a later work[392], where it has been claimed that the fragmentation scheme needs to be carefully selected when dealing with systems involving intramolecular hydrogen bonds, and that CCT yields very good results given an appropriate fragmentation. Indeed, a recent study using CCT has provided excellent results for the ROA spectra of a number of large globular proteins[393], such as hen egg-white lysozyme, comprising well over 100 290 CHAPTER 6. RAMAN OPTICAL ACTIVITY amino acids.

6.3 Technical Details

6.3.1 Zwitterionic Histidine

Histidine is somewhat unique in its ability to uniquely undertake a number of biochemical roles. Traditionally characterised as a polar amino acid, histidine does not substitute particularly well with other amino acids[394]. The nitrogen atoms of the imidazole group have long been known to act as proton shuttles, and grant functionality for a number of enzymatic mechanisms. However, histidine has also been shown to participate in interactions which are typically perceived to involve large aromatic amino acids such as tryptophan and phenylalanine. Noncovalent interactions involving histdine have been shown to exist in the form of π − π stacking between its imidazole ring and a number of DNA nucleobases[395].

Histidine features in the so-called “catalytic triad” of a number of biochemi- cally important enzymes, such as serine and cysteine proteases, where it acts as a proton shuttle during the process of protein catabolism[396]. The Manzetti mechanism[397] suggests that matrix metalloproteinases, a set well-characterised cancer therapeutic targets, attain catalysis through a pair of histidine residues which coordinate a Zn2+ ion. The penta-coordinate Zn2+ ion is induced to act as a reversible electron donor, and can hydrolyse the scissile peptide bond. Indeed, research within the field of biocatalysis has found evidence for histidine residues within tripeptide motifs coordinating even more exotic metal ions, such as Pt2+ and Au3+, and exhibiting chemotherapeutic properties[398]. 6.3. TECHNICAL DETAILS 291

0 Figure 6.1: Zwitterionic histidine (His [Nτ H]) where the imidazole nitrogen atoms have been labelled as indicated in the text. The two sidechain dihedral angles, χ1, χ2 are also labelled.

Five protonation states of histidine exist: His2+, His1+, His0, His1- and His2-, where the superscript denotes the formal charge state of the species. In addition to these protonation states, the imidazole ring in His0 and His1- can exist in one of two tautomeric forms. This originates from the lone pairs of the nitrogen atoms contributing to π-delocalisation around the five-membered imidazole ring. The imidazole nitrogen atoms are labelled Nπ or Nτ , denoting the nitrogen one and two bonds away, respectively, from the alkyl sidechain, as shown in Figure 6.1. For convenience, we also depict the major torsional degrees of freedom that are typically varied for conformational studies of histidine; χ1 and χ2 are the side chain dihedrals.

Tautomeric forms are denoted by specifying the protonated nitrogen, i.e. NπH 292 CHAPTER 6. RAMAN OPTICAL ACTIVITY

or Nτ H. Using this nomenclature, we can uniquely label the seven protonation/ 2+ 1+ 0 0 1- tautomeric states of histidine: His , His , His [NπH], His [Nτ H], His [NπH], 1- 2- His [Nτ H] and His . These states are represented pictorially in Figure 6.2.

Figure 6.2: Zwitterionic histidine protonation states and tautomers.

In the following, we compute the Raman and ROA spectra for the His0 and His1+ charge states, as these are by far the most biologically prevalent forms. In forming the His0 spectra, we include contributions from both tautomers of histidine. A further point worth consideration is that charge states transition in response to a pH. However, pH is a macroscopic quantity, and so cannot be properly accounted for in single molecule studies. We introduce the fractional distribution of charge states in a solution at a given pH, i.e. the proportion of the species in a given charge state at a given pH. We present the fractional distributions of the histidine charge states as a function of pH in Figure 6.3.

At physiological pH, we see that roughly 94% of histidine is the His0 form, with small amounts of His1- and His1+. If we are to compare our single charge state spectra to experimental spectra, we must ensure that the experimental conditions coincide with the maxima of the relative fractional distribution curves. Alterna- tively, if we were attempting to compute the spectra from previously acquired 6.3. TECHNICAL DETAILS 293

Figure 6.3: Fractional contribution of the various histidine charge states as a function of pH. Blue, green, red and yellow curves correspond to the fractional contributions of the His2+, His1+, His0 and His1- states, respectively. We have limited the pH range between 0 and 12, and so omit the fractional contribution of the His2- state.

1

0.9

0.8

0.7

0.6

0.5

0.4 Normalised Fractional Distribution 0.3

0.2

0.1

0 1 2 3 4 5 6 7 8 9 10 11 12

pH

His2+ His1+ His0 His1-

experimental results, we would have to consider the effects of additional charged species, and include their contributions into the final spectrum.

We have prepared samples of histidine at a pH of 7.8 and 4.2, both at a con- centration of 0.26M. Spectra were collected over a period of 7 hours, using an incident wavelength of 532nm, a laser power of 700mW and a 1.47 second exposi- tion time. All spectra have been normalised so that they can be directly compared with the computed spectra. The absolute amplitudes are therefore irrelevant in our analysis. 294 CHAPTER 6. RAMAN OPTICAL ACTIVITY

6.3.2 Computing Spectra from a Number of Conformers

An optical spectrum of a molecular system is a function of the internal degrees of freedom of a system. Two vastly differing conformers of a system possess distinct spectra; while the major features are conserved between the two, e.g. peptide bond signal, a number of the finer details are subject to change. Therefore, when we compute an optical spectrum, we require some means for including the contribution to the spectrum of all conformers that make up the conformational space of the molecular system.

Since we are limited by computational resources, it is simply not feasible to exhaus- tively compute spectra for every possible conformer of a system and combine them to yield the total spectrum. Instead, we are reduced to computing the spectra of a small number of conformers, and using these to form the total spectrum. Each PN conformer spectrum has an associated weight, wk = exp[−β∆G(xk)]/ k=1 wk, where β is the thermodynamic beta (= 1/kBT , where kB is the Boltzmann con- th stant and T is the temperature), ∆G(xk) is the free energy of the k conformer, and the weights are normalised over all conformer weights. Then, given the spec-

th trum associated with the k conformer, fk(ν), the total spectrum is given by

N X f(ν) = wkfk(ν) . k=1 Introducing some nomenclature, each conformer spectrum is Boltzmann weighted. The Boltzmann weight of a spectrum is linked to the proportion of time spent in the the given conformer. By analysis, we see that the lower the free energy of a conformer, the higher its Boltzmann weight, implying that the amount of time the system spends in a given conformational state is a function of its energy.

From the definition of the total spectrum, it is apparent that if we are to select 6.3. TECHNICAL DETAILS 295 only a small number of conformers with which to compute a spectrum, we would like these conformers to be as low in energy as possible to maximise their Boltz- mann weights. Therefore, optimisation of the conformers is required. However, we also require as diverse a conformer set as possible. The collection time for an experimental spectrum is of the order of hours, which means the system is able to sample all regions of conformational space, i.e. barriers between energetic minima do not constrain the system to a single conformational state.

The issue of selecting as diverse a conformational sample as possible is tackled in Section 6.3.3, while the issue of optimising the system is dealt with in Section 6.5.

6.3.3 Filtering of Similar Geometries

  For the following discussion, the 3 × N matrix X = x1 ... xN , N being the number of atoms in a molecule, is used to denote a conformation, where xi is the Cartesian position vector for the ith atom.

Perhaps the most ubiquitous form of geometrical comparison between two struc- tures, X and Y, is the root mean square deviation (RMSD), which we denote R(X, Y), v u N u 1 X R(X, Y) = t |x − y |2 . (6.3.1) N i i i=1 However, in the form of (6.3.1), a number of pitfalls can be encountered when dealing with molecular conformations. The most significant issue is the fact that R(X, Y) is not invariant with respect to rigid transformations. For example, if X and Y possess the same internal geometry, but are rigidly translated relative to one another, then (6.3.1) represents the two geometries as being dissimilar. The same result holds for the case of rigid rotations relative to one another. Molecules 296 CHAPTER 6. RAMAN OPTICAL ACTIVITY

(in homogeneous fields) are entirely equivalent after having undergone rigid trans- lations, and so the RMSD in its current incarnation is of no value for our purposes.

We can render (6.3.1) in a useful form by finding R(X, Y) such that it is minimised with respect to all rigid rotations and translations, i.e. by minimising the function

N 1 X 2 E(X, Y) = |Ux − y | , (6.3.2) N i i i=1 where U is an orthonormal matrix, i.e. U >U = I, the identity matrix, and can therefore be interpreted as a matrix defining a rigid transformation [399]. Tak- ing the square root of E(X, Y) therefore corresponds to the minimised RMSD (mRMSD) between the two molecular conformations. A further step is required to ensure our definition of the mRMSD is robust; if the orthonormal transforma- tion matrix U, does not constitute a right-handed system (det(U) < 0), then it corresponds to an improper rotation, i.e. a transformation that can be represented by some combination of a rotation about an axis and subsequent reflection in a plane. Thus, we require computation of the determinant of U to ensure we deal only with proper rotations.

The method for minimising (6.3.2) with respect to U can be undertaken in a number of ways. Historically, the orthonormality conditions of U have been added to (6.3.2) as Lagrange multipliers. Formal solutions were found by Kabsch[400, 401], after whom the algorithmic formalism for solution of (6.3.2) is named. We proceed in deriving the form of U in a more physically intuitive way, elaborated upon in a number of more recent sources [402, 403].

Conveniently, (6.3.2) can be made invariant with respect to global translations before seeking the form of U. Defining the centroids of X and Y as

N N 1 X 1 X x = x y = y , (6.3.3) C N i C N i i=1 i=1 6.3. TECHNICAL DETAILS 297 we can shift each molecular conformation to its centroid-centred frame of reference by xi = xi − xC and yi = yi − yC , a convention which we adopt herein. Allowing the centroids xC and yC , to coincide can be shown to be equivalent to optimally aligning the two molecular conformations[404]. Thus, we are free to continue with the rotational alignment of the two conformations.

Rewriting (6.3.2), N X 0 2 NE(X, Y) = |xi − y| , (6.3.4) i=1 0 where we have used the shorthand xi = Uxi. Alternatively, we can use a matrix representation h i NE(X, Y) = Tr (X0 − Y)>(X0 − Y) , (6.3.5) where X0 = UX. Expansion of the right-hand side of (6.3.5) can be performed by invoking the identity

Tr(X0 − Y)>(X0 − Y) = Tr(X0>X0) + Tr(Y>Y) − 2 Tr(Y>X0) .

6 0> 0 Pn 2 > Pn 2 Recognising that Tr(X X ) = i=1 xi and Tr(Y Y) = i=1 yi ,

N X h 2 2i > 0 NE(X, Y) = xi + yi − 2 Tr(Y X ) . (6.3.6) i=1 Since the summation is not dependent upon U, the minimisation of E(X, Y) is equivalent to maximising Tr(Y>X0) = Tr(Y>UX). Since the trace of a product of matrices is commutative, the quantity we wish to maximise can be rewritten as Tr(XY>U).

We can write XY> as a singular value decomposition (SVD), i.e. XY> = VΣW>, where V, W are orthonormal matrices of left and right eigenvectors, respectively

6These components do not need to be primed since the rotation matrix U does not modify the length of the xi. 298 CHAPTER 6. RAMAN OPTICAL ACTIVITY

and Σ is a diagonal matrix of ordered eigenvalues, i.e. σ11 > σ22 > .... Once again invoking the commutativity of the trace of a product of matrices

3 > 0 > > X > Tr(T X ) = Tr(VΣW U) = Tr(ΣW UV) = σiiwi Uvi , (6.3.7) i=1

th where wi and vi are the i columns of W and V, respectively. Introducing T = W>UV, 3 3 > 0 X X Tr(Y X ) = σiitii ≤ σii , (6.3.8) i=1 i=1 where the inequality results from the fact that T is orthonormal since it is formed from a product or orthonormal matrices. The orthonormality of T also guarantees det[T] = ±1. By analysis, we see that this quantity is maximised when T is equal to the identity matrix. Following from this, T = W>UV = I, and since W, V are orthogonal, WIV> = WW>UVV>, implying

WV> = U , (6.3.9) where W> and V are the right- and left-eigenvectors of the covariance matrix XY>.

Our final consideration centres on whether U defines a proper or improper rotation, as we have previously alluded to. We can account for this by specifying that if det(XY>) > 0, then d = 1.0, otherwise d = −1.0, allowing us to give the final expression for U,   1 0 0       > U = W  0 1 0  V . (6.3.10)     0 0 d We are now in a position to give the algorithmic workflow for ascertaining the Kabsch RMSD between two molecular geometries in Algorithm7. 6.4. COMPUTATIONAL DETAILS 299

1. Build the geometry matrices X, Y 2. Transform X and Y to their respective centroid-centred frames 3. Compute the covariance matrice, XY> 4. Form the SVD of the covariance matrix, XY> = VΣW> 5. Calculate d = det[XY>] 6. Compute the Kabsch rotation matrix, U, from (6.3.10) Algorithm 7: Kabsch RMSD Algorithm

We propose a workflow for the implementation of the Kabsch RMSD for computing a set of maximally dissimilar molecular conformations, the “ensemble”, from a larger set. To this end, we propose the use of a greedy algorithm similar to that introduced in Section 5.2.2, the MaxMin algorithm. Given some initially large set of molecular conformations, we implement Algorithm2, until an ensemble of the required size has been constructed. This ensemble then corresponds to the set of structures that are maximally dissimilar from the provided larger set.

6.4 Computational Details

6.4.1 Molecular Dynamics

The arrangement of solvent molecules is typically accomplished by use of molec- ular dynamics. The system is solvated and the resultant trajectory is parsed into a number of snapshot configurations. These configurations are subsequently used as the conformational ensemble for spectral calculations. Recent work has se- lected snapshots entirely at random, with very rough (qualitative) estimates used to maximise the conformational diversity of the ensemble. The snapshots are then energetically minimised, and those snapshots with a high Boltzmann weight form an ensemble which are considered representative of the dominant system conforma- 300 CHAPTER 6. RAMAN OPTICAL ACTIVITY tions. Since these conformations are dominant, it is presumed that they contribute most to the spectrum of the system.

0 0 We have conducted 100ns molecular dynamics trajectories for His [Nτ H], His [NπH] and His1+, solvated in boxes comprising 469 water molecules, with a chloride ion added in the latter case to ensure the system remained electrically neutral. Throughout, we will use the gromacs molecular dynamics package[405] in con- junction with the opls-aa force field[406].

Initial energy minimisation of these systems was undertaken with the steepest de- scent method, until convergence of the total system energy was attained. Following energy minimisation, we have conducted two separate equilibration phases. The first was a 1 ns NVT equilibration, using the Berendsen thermostat to maintain a temperature of 300 K, whilst the second was a 1ns NPT equilibration using the Parrinello-Rahman barostat to maintain a temperature and pressure of 300 K and 0.1 MPa, respectively. The particle mesh Ewald methodology was used to treat long-range electrostatics. A cutoff distance of 10 A˚ was chosen for Coulombic and van der Waals interactions. Both stages of equilibration were verified as be- ing properly equilibrated by checking the convergence of the temperature for the NVT equilibration and the density of the system for the NPT equilibration. Both showed convergence after roughly 100 ps, and so our equilibration was assumed sufficient.

Our final step involved a 100 ns production molecular dynamics run in the NPT ensemble with temperature and pressure of 300 K and 0.1 MPa (standard condi- tions), respectively. To this end, we have used the Berendsen thermostat coupled with the isotropic Parrinello-Rahman barostat. The dynamical timestep used for our simulations was 0.5 fs, with a snapshot geometry output every 5ps, resulting in 20,000 snapshot geometries. Coulombic and van der Waals interactions were 6.4. COMPUTATIONAL DETAILS 301 treated as we have described in the preceding paragraph.

6.4.2 Geometric Filtering

Past work[407] on computing the Raman and ROA spectra of zwitterionic histidine has implemented a simplistic conformational selection tool. Six major in vacuo conformational preferences for histidine were proposed to arise from the χ1, χ2 ◦ ◦ torsional degrees of freedom; trans-plus (χ1 = 90 , χ2 = 180 ); trans-minus (χ1 = ◦ ◦ ◦ ◦ −90 , χ2 = 180 ); gauche-plus-plus (χ1 = 90 , χ2 = 60 ); gauche-plus-minus (χ1 = ◦ ◦ ◦ ◦ −90 , χ2 = 60 ); gauche-minus-plus (χ1 = 90 , χ2 = −60 ); and gauche-minus- ◦ ◦ minus (χ1 = −90 , χ2 = −60 ). These conformers were subsequently solvated in an ad hoc manner and energetically minimised. However, the solvation of in vacuo energetic minima does not yield a physically realistic conformational ensemble. Intramolecular hydrogen bonds stabilise in vacuo structures, which are not as preferable when the solute is solvated. Under explicit solvation, interactions with solvent are more favourable than the intramolecular hydrogen bonds[408]. Not only this, but zwitterionic amino acids are not stable in vacuo. This instability necessitates the introduction of non-physical constraints on energetic optimisation to maintain the zwitterionic form.

For each system, we have utilised the Kabsch selection methodology outlined in Section 6.3.3. For each trajectory comprising 20,000 snapshots, we have parsed the 20 snapshots whose solute geometries are the most mutually diverse (as outlined in Section 6.3.3). Solvent geometries were not included in the Kabsch filtering. We present a conformer analysis of the solute geometries obtained by the Kabsch filtering in Figure 6.4.

We have highlighted in grey those regions of the (χ1, χ2) plot in Figure 6.4 that 302 CHAPTER 6. RAMAN OPTICAL ACTIVITY

Figure 6.4: Sidechain dihedral angles (χ1, χ2) taken by conformers within the three histidine ensembles obtained by a Kabsch filtering of the MD trajectories; His1+ 0 0 (blue circles), His [NπH] (green diamonds) and His [Nτ H] (red triangles). Grey hatched regions correspond to those conformers used in the work of Deplazes et al.

180

150

120

90

60

30

0 (degrees) 2 χ -30

-60

-90

-120

-150

-180 -180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180 χ 1 (degrees)

1+ 0 0 His His [NπH] His [NτH] 6.4. COMPUTATIONAL DETAILS 303 correspond to the in vacuo minima used in [407], ±30◦. Concerning the His1+ con-

◦ ◦ formers, the (χ1 = 60 , χ2 = −60 ) region is not sampled at all. Analysis shows this to result from an unfavourable ammonium-imidazole NπH contact. Concern- 0 ◦ ing the His conformers, when χ2 = 90 , a favourable ammonium-imidazole Nπ 0 interaction stabilises the His [Nτ H] tautomer.

◦ 0 The poor sampling of the χ1 = −60 region for the His [Nτ H] tautomer arises from an inability to form a stabilising carboxylate-imidazole NπH interaction. In con- 0 1+ trast, the protonated Nπ in the His [NπH] and His forms permits this stabilising interaction.

◦ Importantly, we see the significant levels of sampling in the χ1 = ±180 regions for all forms of histidine. These conformations correspond to an extended imidazole group, where no intramolecular interactions involving the zwitterionic groups are formed. In these conformations, solvent molecules are able to favourably interact with both the peptide groups and the imidazole, in keeping with the findings of [408]. We have therefore demonstrated the inadequacy of using in vacuo minima for the conformational sampling of solvated systems.

For our microsolvation studies, we have decided to use water clusters comprising 5, 10, 15 and 20 explicit water molecules. To this end, we selected the closest {5, 10, 15, 20} water molecules to the centre of mass of the zwitterionic histidine from each conformer. The Raman and ROA spectra for each of these was computed and compared to experiment, in the expectation that the increasing number of water molecules would yield spectra more similar to experiment. 304 CHAPTER 6. RAMAN OPTICAL ACTIVITY

6.4.3 Calculation of Spectra

All ab initio calculations are undertaken using the Gaussian09 software package, with a two-layer ONIOM[409] method. Histidine is modelled in the high layer, and solvent is modelled in the low layer. The high layer is treated with the B3LYP/6- 31G(d) level of theory, and the low layer with the AMBER99SB force field in- cluding TIP3P parameters for the water molecules. Electronic embedding is used, so that the low layer electrostatics are incorporated into the quantum mechanical Hamiltonian. When appropriate, ab initio optimisations are performed with the Berny algorithm in the Cartesian basis, and we specify that no microiterations are used for electronic embedding.

For calculation of the molecular optical activity tensors, we invoke the analyti- cal twostep formalism, or n + 1 algorithm[410], in which the harmonic frequency and ROA tensor calculations can be separated into differing levels of theory. For calculation of the normal coordinates and harmonic frequencies, we have used the B3LYP/6-31G(d) level of theory, while for the frequency-dependent ROA tensors[411], we have used the B3LYP/rDPS level of theory[412]. rDPS is a rar- efied basis set that is based on the 3-21++G basis set with semi-diffuse p functions on all hydrogen atoms[413]. rDPS has been shown to provide a similar level of accuracy in the resulting spectra to much larger Dunning basis sets, such as aug- cc-pVDZ.

An excitation wavelength of 532nm was used for calculation of the ROA tensors, in keeping with the conventional experimental setup. We have obtained scattered cir- cular polarisation backscattered (SCP-180) ROA intensities. Individual conformer spectra are Boltzmann weighted, and those with a Boltzmann weight exceeding 0.5% are used to form the spectrum (using a Lorentzian bandwidth of 10cm−1), 6.5. CONFORMER OPTIMISATION 305 for comparison with experimental spectra.

When referring to peptide Raman/ ROA spectra, there are a number of spectral windows that are prominent enough to merit individual names. The Amide I band is situated at roughly 1650 cm−1, and primarily comprises carbonyl stretching modes. The Amide II band is situated at roughly 1550 cm−1 and comprises equal proportions of C-N-H bending and C-N stretching modes. Finally, the Amide III band is situated near 1300 cm−1, and is dominated by the same modes as for the Amide II band. We will refer to these regions by their respective names herein.

6.5 Conformer Optimisation

The optimisation of solvated systems is notoriously difficult owing to the absence of well-defined minima on the PES. If the PES is significantly undulant, conventional minimisation algorithms struggle, as low order spatial derivatives do not suffice in describing the local topology of such functions[291], and cannot direct the system to appropriate regions on the PES. One is frequently reduced to sequentially opti- mising the system at a number of increasingly complex levels of theory to obtain convergence of the maximum atomic forces and displacements. Whilst tedious, it is also no guarantee of convergence, as we now outline.

To circumvent having to compute (time-consuming) spatial derivatives of the en- ergy at every step of the geometry optimisation, updating methods are typically invoked[414, 415]. Whilst one can in principle explicitly compute the spatial deriva- tives of the energy at each optimisation step, for large systems this is not feasible. As such, the Hessian is explicitly computed at the start of the geometry optimi- sation, and subsequently updated as a function of the displacement of the system 306 CHAPTER 6. RAMAN OPTICAL ACTIVITY from this initial geometry by use of low order derivative information. By approx- imating the Hessian at each step of the geometry optimisation, one can save a great deal of time. Indeed, such Hessian updating schemes have proven to be very accurate for a number of systems[416, 417].

However, consider now the case where the system begins in a conformation which is far from the energetic minimum. More steps are required for the geometry optimisation to reach an energetic minimum. Given that the potential energy surface is highly undulant, the Hessian updating scheme leads to increasingly poor approximations to the Hessian as the optimisation proceeds further from the initial geometry. As such, the geometry optimisation fails to converge. One is then required to adopt some intermediate approach, where the Hessian is explicitly calculated, updated for a number of timesteps, followed by explicit recomputation. The frequency with which one performs the explicit recomputation of the Hessian is then a matter for striking a balance between efficiency and accuracy.

An alternative approach to optimising the entire system has been alluded to in Section 6.2.2. In the work of Zielinski[378] and Mutter[377], two levels of opti- misation were considered; OptAll and OptSolute. The former optimises the entire system, and the latter only the solute in the field of the unoptimised solvent. Noting that, in these studies, the optimisations were performed on snapshots from MD trajectories, the system is initially far from the minimum energy conforma- tion, rendering the optimisation all the more difficult. Hence, if one is able to “get away with” a simpler optimisation, the savings will be substantial.

The reasonable agreement of OptSolute with OptAll in the referenced works indicates that one does not necessarily require the entire system to be optimised to obtain informative spectra. Considering the solute-solvent system as a whole, the OptSolute methodology calculates the spectra of unoptimised systems; it is 6.5. CONFORMER OPTIMISATION 307 only a subsystem (the solute) that is optimised. This then begs the question of the value of energetic minima for these calculations, and whether strict geometry optimisation is an unnecessary burden.

The OptSolute model is appealing owing to the reduced computational complex- ity of the resultant optimisation. However, its use raises three pertinent questions; (1) How should the individual conformer spectra be Boltzmann weighted; (2) If the OptSolute scheme is valid, can we simplify the calculations further by adopting an even less rigorous optimisation criterion, and; (3) Are alternative approximate optimisation schemes viable? We deal, and elaborate upon, each of these questions in turn in the following sections. For our test case, we take microsolvated zwitteri- onic histidine with five explicit water molecules. The conformers have been taken from the snapshots obtained with the methodology outlined in Section 6.4.

6.5.1 Boltzmann Weighting

If the solute is optimised in the field of the unoptimised solvent, we query whether it is proper to use the energy of the entire system for Boltzmann weighting the conformation. We make the (reasonable) assumption that the dominant spectral features arise from the solute, particularly in the higher wavenumber regions, where librational modes do not feature. Consider the effects of optimising the solute while freezing the solvent: the solute will adopt an energetically preferable conformation, while the solvent remains unoptimised. The Boltzmann weight of the system could be relatively low owing to a particularly unfavourable solvent conformation. However, if the solute is in a comparatively low energy conformation, the low Boltzmann weight is not a proper reflection of the contribution of the conformer’s spectrum to the overall spectrum. 308 CHAPTER 6. RAMAN OPTICAL ACTIVITY

In light of this argument, we propose two means for Boltzmann weighting a the spectrum of a given conformation. The first takes the energy of the entire solute- solvent system (“system-weighting”), while the second takes the energy of just the optimised solute, without the solvent (“solute-weighting”). We evaluate the effects of the two weighting schemes on the conformers of microsolvated histidine. The Boltzmann weights of the conformers from the two schemes are compared with one another in Table 6.1.

Conformer Number System Weight (%) Solute Weight (%) 1 31.6 14.7 2 22.3 21.6 11 6.7 16.3 25 20.1 20.6 27 18.0 23.6 35 1.3 3.2

Table 6.1: Conformer Boltzmann weights exceeding 1% for the 35 microsolvated histidine systems using the system-weighting and solute-weighting schemes.

From Table 6.1, we see that for both Boltzmann weighting schemes, the same six conformers dominate the resultant spectrum, albeit in different proportions. The first conformer is dominant with system-weighting, while the Boltzmann weight of the first conformer is roughly three times lower with solute-weighting. We can conclude that, in this case (and we have no reason to suspect that the result would not be general), both weighting schemes yield the same dominating conformers, but their contributions to the resulting spectrum differ rather significantly. To evaluate whether the resultant spectra are affected by the differing Boltzmann weights, we present the Raman and ROA spectra generated from the two schemes, in addition to the experimental spectrum, in Figures 6.5 and 6.6, respectively. 6.5. CONFORMER OPTIMISATION 309

Figure 6.5: Raman spectra of the different Boltzmann weighting schemes. The top pane shows the Raman spectrum resulting from the system-weighting and the middle pane from the solute-weighting. The bottom pane is the experimental Raman spectrum of zwitterionic histidine.

System-Weighted

Solute-Weighted Raman Intensity Raman

Experimental

400 600 800 1000 1200 1400 1600 1800

Wavenumber (cm-1) 310 CHAPTER 6. RAMAN OPTICAL ACTIVITY

Figure 6.6: ROA spectra of the different Boltzmann weighting schemes. The top pane shows the ROA spectrum resulting from system-weighting and the middle pane from the solute-weighting. The bottom pane is the experimental ROA spec- trum of zwitterionic histidine.

System-Weighted

Solute-Weighted ROA Intensity ROA

Experimental

400 600 800 1000 1200 1400 1600 1800

Wavenumber (cm-1) 6.5. CONFORMER OPTIMISATION 311

From the Raman spectra in 6.5, there appear to be no discernible difference be- tween the two weighting schemes. A number of subtle exceptions exist, but cer- tainly none which render one spectrum superior to the other when compared with the experimental spectrum. From the ROA spectra in 6.6, we see that the two Boltzmann weighting schemes yield spectra that are again quite similar to the experimental spectrum. This similarity suggests that the Boltzmann weighting scheme does not massively impact on the spectra. However, there are a number of slight differences between the two. The 1350-1450 cm−1 region is the dominant region of the experimental spectrum, with a well-defined +ve/-ve/+ve profile. We see that the first positive region is recovered by both of our calculated spectra, but is arguably better modelled with the solute-weighting.

Indeed, it is interesting to note that this first peak is slightly blue-shifted relative to the experimental spectrum for the system-weighting. With the solute-weighting, this peak coincides quite well with the experimental spectrum. The subsequent -ve/+ve profile is present and well-recovered by both, but again the final positive peak is better recovered with the solute-weighting. The triplet of peaks in the 900-1100 cm−1 region of the experimental spectrum is also a well-defined feature, reproduced by both of our computed spectra. However, it is difficult to characterise which of the computed spectra models this region better.

From the results we have given, we have concluded that the difference between the two Boltzmann weighting schemes is small. In the absence of a definitive result, we have chosen to use the solute-weighting scheme owing to the small improvements we have alluded to in the 1350-1450cm−1 region. We also believe that considering only solute energies makes intuitive sense in the context of the optimisation schemes introduced in Section 6.5.3. 312 CHAPTER 6. RAMAN OPTICAL ACTIVITY

6.5.2 Effects of Level of Optimisation

We now assess whether we can circumvent the need to perform stringent geometry optimisations on our system. We have used three levels of optimisation for the comparison; no optimisation (conformers taken straight from the MD trajectory), the “Loose” (Maximum force ≤ 2.5 × 10−3 au ; Maximum displacement ≤ 1.0 × 10−2A),˚ and “Regular” (Maximum force ≤ 4.5×10−4 au ; Maximum displacement ≤ 1.8 × 10−3A)˚ optimisation schemes available in the gaussian package. The resultant spectra are given in Figures 6.7 and 6.8. A fourth level of optimisation, “Tight” (Maximum force ≤ 1.5×10−5 au ; Maximum displacement ≤ 6.0×10−5A),˚ was also attempted, but the system proved difficult to converge, and so we have not pursued this case.

The spectra obtained from the completely unoptimised snapshots differ signifi- cantly from the other spectra formed from optimised geometries. This poor mod- elling is indicative of the fact that the MD snapshots are very far from energetic minima, and so are not representative of the conformational ensemble of the sys- tem. The ultimate arbiter is the comparison of the calculated spectrum with the experimental spectrum, and we see that agreement is poor between the unopti- mised and experimental spectra. There appear to be no features of the experimen- tal spectrum that are reproduced by the unoptimised spectrum.

Turning our attentions to the optimised spectra, we see that the Raman spectra are very similar to one another, and reproduce a number of the features present in the experimental spectrum. The series of strong peaks in the 1200-1600 cm−1 region are largely accounted for in our computed spectra. However, we are unable to characterise one level of optimisation as superior to the other from the Raman spectra. 6.5. CONFORMER OPTIMISATION 313

Figure 6.7: Raman spectra of zwitterionic histidine computed from a number of levels of geometrical optimisation. The top pane contains the Raman spectrum computed directly from unoptimised MD snapshots, while the middle two con- tain Raman spectra computed from conformers which have been optimised with the “Loose” and “Regular” convergence criteria. The bottom panel contains the experimental Raman spectrum of histidine.

Unoptimised

Loose Optimisation

Regular Optimisation Raman Intensity Raman

Experimental

400 600 800 1000 1200 1400 1600 1800

Wavenumber (cm-1) 314 CHAPTER 6. RAMAN OPTICAL ACTIVITY

Figure 6.8: ROA spectra of zwitterionic histidine computed from a number of levels of geometrical optimisation. The top pane contains the ROA spectrum computed directly from unoptimised MD snapshots, while the middle two contain Raman spectra computed from conformers which have been optimised with the “Loose” and “Regular” convergence criteria. The bottom panel contains the experimental ROA spectrum of histidine.

Unoptimised

Loose Optimisation

Regular Optimisation ROA Intensity ROA

Experimental

400 600 800 1000 1200 1400 1600 1800

Wavenumber (cm-1) 6.5. CONFORMER OPTIMISATION 315

The ROA spectra do, however, differ to some degree. The spectral features in the 800-1000 cm−1 window appear to be poorly characterised by the regular op- timisation spectrum, but are relatively strong in the loose optimisation spectrum. It could be that the amplitude of the peak at roughly 1100 cm−1 dwarfs these features in the regular optimisation, since it appears to be stronger than in the loose optimisation. We note that this peak is not particularly strong in the ex- perimental spectrum. Assessing the 1350-1450 cm−1 region, both the loose and regular optimisation schemes reproduce the +ve/-ve/+ve signal, with there being no notable difference in their qualities relative to the experimental spectrum.

Based on these findings, we have deemed the loose optimisation to be sufficient in reproducing the spectral features of the experimental spectrum. The regular optimisation offers little improvement, if at all, in the calculated spectra relative to the loose optimisation. Considering the convergence level is roughly an order of magnitude more stringent for the regular optimisation, it seems to be an unnec- essary additional effort to optimise the systems so thoroughly. Indeed, the loose optimisation requires roughly half the amount of time to compute relative to the regular optimisation for this system. Ensuring the system is at least somewhat close to an energetic minimum appears to be sufficient for computing spectra. However, we reiterate that if the system is far from an energetic minimum, as we understand the unoptimised MD snapshots to be, one obtains extremely poor computed spectra.

6.5.3 Alternative Optimisation Schemes

Given that the optimisation of a subsystem leads to good quality spectra, the final point we wish to address is whether alternative subsystem optimisation schemes 316 CHAPTER 6. RAMAN OPTICAL ACTIVITY yield equally informative spectra. The OptAll scheme has been found to produce the highest quality spectra, and so this is the case we use as a benchmark. However, we can also formulate several alternative subsystem optimisation schemes;

(a) OptSolute The solute is optimised in the field of the unoptimised solvent. We have already alluded to this approach and its adoption in previous work, yielding spectra that are comparable to OptAll.

(b) OptSolvent The converse approach to OptSolute, where the solvent is optimised in the field of the unoptimised solute. This approach is potentially all the more appealing, since the optimisation takes place in the low layer of the QM/MM, and is therefore faster.

(c) OptSolvent→OptSolute The OptSolvent methodology is invoked, followed by OptSolute. By optimising the layers separately, we hope that we can recreate the optimisation resulting from OptAll.

(d) OptSolute→OptSolvent The OptSolute methodology is invoked, followed by OptSolvent. We do not imagine this to differ from the previous approach, but is included for the sake of completeness.

The first two methods, (a) and (b), are concerned with obtaining crude approxi- mations for a large decrease in computational cost. We expect these methods to yield spectra of far poorer quality than OptAll, but to be far quicker to compute. Contrastingly, the last two methods, (c) and (d), look to replicate the quality of 6.5. CONFORMER OPTIMISATION 317 spectrum obtained by the OptAll scheme, while saving some level of computa- tional time in the process. In the following, we will refer to each optimisation scheme by the letter it has been listed with in the above enumeration. When referring to OptAll, we will use (e).

From the Raman spectra of Figure 6.9, we immediately see that (b) performs particularly poorly. All three Amide I-III regions are poorly represented, and the intensity of the low wavenumber regions is not in agreement with the experimental spectrum. However, the other four optimisation schemes yield spectra that are in good agreement with the experimental spectrum. The Amide I-III regions are well-recovered, while the low wavenumber regions are modelled equally well. If we are to be fastidious, we may question the Amide I regions of (a) and (d), where the signal does not possess the same intensity as that in the experimental spectrum. The best agreement with the experimental spectrum is arguably (c), which appears to recover the vast majority of spectral features featured in (e).

Turning our attention to the ROA spectra of Figure 6.10, we notice that the poor performance of (b) is continued, and virtually no experimental features are recov- ered. We also draw attention to the poor performance of (d) in the low wavenumber regions. In turn, this poor modelling of the low wavenumber regions obscures the finer details of the high wavenumber regions, and the spectrum as a whole dete- riorates. Similarly, (a) appears to be a poor approximation to the experimental spectrum; the coarse details of the spectrum appear to be present, but the high wavenumber regions are poorly modelled. In contrast, (c) is in excellent agreement with both the spectra obtained through experiment and (e). With the exception of a few features in the low wavenumber regions, (c) and (e) are almost identical.

Offering some explanation for the results we have obtained, the optimisation of the solute appears to be the key factor in guaranteeing the quality of computed spectra. 318 CHAPTER 6. RAMAN OPTICAL ACTIVITY

Figure 6.9: Raman spectra resulting from the optimisation methodologies we have outlined. Enumeration corresponds to the labels we have used in the main text.

OptSolute

OptSolvent

OptSolvent - OptSolute

OptSolute - OptSolvent Raman Intensity Raman

OptAll

Experimental

400 600 800 1000 1200 1400 1600 1800

Wavenumber (cm-1) 6.5. CONFORMER OPTIMISATION 319

Figure 6.10: ROA spectra resulting from the optimisation methodologies we have outlined. Enumeration corresponds to the labels we have used in the main text.

OptSolute

OptSolvent

OptSolvent - OptSolute

OptSolute - OptSolvent ROA Intensity ROA

OptAll

Experimental

400 600 800 1000 1200 1400 1600 1800

Wavenumber (cm-1) 320 CHAPTER 6. RAMAN OPTICAL ACTIVITY

The poor results from (b) and (d) presumably derive from the optimisation of the solvent being the final optimisation prior to spectrum calculation, and so the solute is in an unoptimised state. To recreate the finer details of the spectrum, it also appears as if the solvent requires some level of optimisation, particularly with the ROA measurements. Hence why (c) significantly outperforms (a). The Raman spectra seem to be robust with regards to the level of solvent optimisation.

In conclusion, we feel that (c) and (e) offer the best recovery of experimental spec- tral features. However, we have found that the computational cost associated with (c) more than fourfold less on average relative to (e). Relative to the computa- tional cost associated with (a), (c) is roughly 50% more expensive computationally. Therefore, it seems as if the OptSolvent → OptSolute methodology strikes an ideal balance between accuracy and compute time. We proceed with this method- ology through the remainder of our work.

6.6 Microsolvation

We define microsolvation to be the explicit solvation of a system using a small num- ber of water molecules, typically not extending beyond the second solvation shell of the solute. The major question we wish to answer is whether computed spec- tra from microsolvated systems are able to model experimental spectra, without having to resort to performing calculations on large explicitly solvated systems. If we can answer this query positively, then it facilitates the computation of solvated Raman and ROA spectra7.

To summarise those methodological choices we have made from the results in

7In fact, it is reasonable to generalise these results to the majority of vibrational spectroscopies. 6.6. MICROSOLVATION 321

Section 6.5, we have used the “Loose” optimisation setting to optimise all geome- tries, subject to the OptSolvent → OptSolute optimisation methodology. All spectra are reconstituted based on the Boltzmann weights of the solute molecule, independent of the energetic contribution of the solvent.

6.6.1 Neutral Histidine

For the Raman spectra in Figure 6.11, we find that the Amide I band at around 1600 cm-1 is well recovered by each of the microsolvated systems. Since the Amide I band is largely indicative of peptide C=O stretching modes, it is initially surprising that its modelling appears to be independent of the level of microsolvation, where solvent interactions presumably damp the stretching mode. However, upon closer inspection of the microsolvated conformers used for the spectra, we find that the peptide C=O is solvated in each conformer, rendering its modelling independent of the degree of microsolvation. Similarly, we recover the Amide II triplet at ∼1450-1550 cm-1, albeit blue-shifted relative to the experimental spectrum. The triplet signature appears to be most profound with a solvation shell comprising 15 water molecules, where the central peak is dominant. This result is to be expected since a solvation shell of 15 water molecules appears to coincide with the number of water molecules comprising the first solvation shell for zwitterionic histidine. Since the Amide II region is representative of peptide N-H bending and C-H stretching modes, we expect this region to be highly dependent on solvent interactions. It is, nevertheless, surprising that the quality of our computed spectra deteriorates with 20 water molecules.

The Amide III region is slightly less well-modelled by our microsolvated systems. In the experimental spectrum, we observe a well-defined quadruplet of peaks, span- 322 CHAPTER 6. RAMAN OPTICAL ACTIVITY

Figure 6.11: Raman spectra of microsolvated histidines. From the top pane to the bottom pane, we show histidine microsolvated by 5, 10, 15 and 20 explicit water molecules. The final pane contains the experimental histidine Raman spectrum.

5 Water Molecules

10 Water Molecules

15 Water Molecules Raman Intensity Raman

20 Water Molecules

Experimental

400 600 800 1000 1200 1400 1600 1800

Wavenumber (cm-1) 6.6. MICROSOLVATION 323 ning 1200-1350 cm-1. This region is dominated by C-N stretching and N-H bending modes, and we would expect this region to be as solvation-dependent as the other two regions we have discussed. We note that we do not manage to definitively re- cover the clear quadruplet seen in the experimental spectrum, although the profile does seem to become more enhanced as we increase the degree of microsolvation. Indeed, when 20 water molecules are included, we are able to discern four clear peaks in the Amide III region.

Regarding the impact of the tautomeric forms of histidine, Ashikawa and Itoh[418]

0 0 have found that the breathing motions for His [NπH] and His [Nτ H] can be related to Raman peaks at 1304/1260 cm-1, respectively. In the same work, two additional marker regions characterising imidazole stretching modes between the tautomeric forms of histidine were proposed; 1568/1585 cm-1 and 1090/1105 cm-1. Later work by Toyama et al.[419] has suggested an additional marker region at 1320/1354 cm-1.

Unfortunately, the majority of these marker regions inhabit the strong amide re- gions of the Raman spectrum, making them difficult to discern. However, the 1090/1105 cm-1 region occupies a low intensity region of the Raman spectrum, and appears to be characterised in both our experimental and computed spectra, at each microsolvation level excluding the spectrum with 5 waters. 324 CHAPTER 6. RAMAN OPTICAL ACTIVITY

Figure 6.12: ROA spectra of microsolvated histidines. From the top pane to the bottom pane, we show histidine microsolvated by 5, 10, 15 and 20 explicit water molecules. The final pane contains the experimental histidine ROA spectrum.

5 Water Molecules

10 Water Molecules

15 Water Molecules ROA Intensity ROA

20 Water Molecules

Experimental

400 600 800 1000 1200 1400 1600 1800

Wavenumber (cm-1) 6.6. MICROSOLVATION 325

The experimental ROA spectrum is not quite as well-recovered as the microsol- vated spectra, as depicted in Figure 6.12. This is not to say that the ROA spectra are of poor quality, but simply suffer by comparison with the high quality of the Raman spectra. The dominant feature across both experimental and calculated spectra is the +ve/-ve/+ve signal at 1300-1500 cm-1, and is well-recovered at all levels of microsolvation. However, it is worth pointing out that the calculated spec- tra appear to be blue-shifted by ∼50 cm-1 relative to the experimental spectrum. This blue-shifting is somewhat surprising. Note that we have not undertaken abscissa-scaling by 0.96, as prescribed by Radom[383]. Doing so would blue-shift the spectrum further, and so would result in the deterioration of our calculated spectra.

We note that as the level of microsolvation increases, the peak at 1050 cm-1 is am- plified. A number of peaks exist in this region of the experimental spectrum, but none are as prominent as that in the microsolvated spectrum. Assessing the nor- mal modes of motion around this region, we find that the majority are dominated by sidechain dihedral torsional degrees of freedom. This observation then suggests that the sidechain torsional degrees of freedom are more flexible in the microsol- vated systems than in the experiment. Indeed, it is predominantly the zwitterionic groups and imidazole ring that form interactions with the solvent molecules in the microsolvated systems, and so it is to be expected that the alkyl sidechain be poorly solvated, and so not well recovered by the microsolvated spectra.

6.6.2 Protonated Histidine

The literature regarding the calculation of Raman and ROA spectra of formally charged species is sparse. Indeed, the ab initio modelling of formally charged 326 CHAPTER 6. RAMAN OPTICAL ACTIVITY species is in itself difficult, since the basis sets require both diffuse functions and some description of polarisation, the former being of particular importance for anions. To assess the quality of Raman and ROA spectra of a formally charged species, we have selected cationic histidine. The fact that no tautomers exist for this species makes its modelling simpler.

A formally charged species is presumed to undergo stronger interactions with sol- vent than the neutral species, and so we hypothesise that the degree of micro- solvation will significantly influence the quality of the computed spectrum. Since the formal charge results from the protonation of the imidazole ring, we anticipate that the higher wavenumber regions will be poorly reproduced at the low levels of microsolvation since the explicit water molecules tend to primarily interact with the zwitterionic groups.

The computed Raman spectra of Figure 6.13 are in relatively good agreement with the experimental Raman spectrum. The strong peaks at ∼1200 cm-1 and ∼1300 cm-1 become increasingly prominent as the degree of microsolvation increases. The peak at ∼1500 cm-1 is well-recovered by each of the microsolvated spectra. The resolution of the doublet of peaks at roughly 800 cm-1 is similarly improved as the degree of microsolvation increases. Interestingly, the general agreement with the experimental spectrum deteriorates when 10 explicit water molecules are used for microsolvation. However, those spectra including 5, 15 and 20 explicit water molecules are very similar.

We offer a potential reason for the deterioration in spectrum quality when 10 explicit water molecules are included. When 5 explicit water molecules are in- cluded, the conformers are largely solvated around the backbone amide and car- boxyl groups of the zwitterionic histidine. When 15 and 20 explicit water molecules are included, the conformers are completely solvated, i.e. both the backbone and 6.6. MICROSOLVATION 327

Figure 6.13: Raman spectra of microsolvated cationic histidine. From the top pane to the bottom pane, we show cationic histidine microsolvated by 5, 10, 15 and 20 explicit water molecules. The final pane contains the experimental histidine Raman spectrum.

5 Water Molecules

10 Water Molecules

15 Water Molecules ROA Intensity ROA

20 Water Molecules

Experimental

400 600 800 1000 1200 1400 1600 1800

Wavenumber (cm-1) 328 CHAPTER 6. RAMAN OPTICAL ACTIVITY

Figure 6.14: ROA spectra of microsolvated cationic histidine. From the top pane to the bottom pane, we show cationic histidine microsolvated by 5, 10, 15 and 20 explicit water molecules. The final pane contains the experimental histidine ROA spectrum.

5 Water Molecules

10 Water Molecules

15 Water Molecules ROA Intensity ROA

20 Water Molecules

Experimental

400 600 800 1000 1200 1400 1600 1800

Wavenumber (cm-1) 6.6. MICROSOLVATION 329 sidechain groups of the zwitterionic histidine. However, when 10 explicit water molecules are included, only a couple of water molecules solvate the imidazole ring of the zwitterionic histidine. As such, we suggest that the partial solvation of the imidazole results in a biasing of certain vibrational frequencies originating from the imidazole-solvent interactions. The spectrum as a whole deteriorates since these imidazole-solvent frequencies then dominate the resultant spectrum. In summary, it is tempting to think of partial solvation as detrimental to the correct prediction of the vibrational spectra.

The computed ROA spectra in Figure 6.14 are, however, not quite as successful in reproducing the major features of the experimental spectrum. The strong negative band at ∼1400 cm-1 is recovered in each of the computed spectra, and the positive doublet in the 1300-1350 cm-1 region appears to become clearer as the level of microsolvation increases.

From previous, unpublished work on anionic adenosine triphosphate, we have found that the calculation of ROA spectra for formally charged species is rather difficult. One speculative reason for the poorer quality of the ROA spectra rela- tive to the Raman spectra is the way one computes the intensity of spectral bands for the two spectroscopies. For Raman spectra, the intensity is given by the sum of the intensities of the right- and left-circularly polarised scattered light. For ROA spectra, the intensity is given by the difference in the intensities of the right- and left-circularly polarised scattered light. Therefore, ROA band intensities are more prone to error than the Raman band intensities. In other words, the ROA band intensities are smaller in magnitude than the Raman band intensities, and so the errors are amplified. It is this sensitivity to error that makes ROA a “gold- standard” spectroscopic technique, whereas Raman is somewhat more forgiving. 330 CHAPTER 6. RAMAN OPTICAL ACTIVITY 6.7 Conclusion

We have investigated and presented the results for a number of novel methodologies that can be used in the calculation of ab initio Raman and ROA spectra of solvated systems. We are suggestively vague in our use of the term “solvated systems”, since we see no reason why our methodology cannot be applied to systems outside of the domain of peptides, such as DNA bases[420, 421]. A novel conformational sampling methodology allows for the unambiguous extraction of a set of mutually diverse conformers from an MD trajectory of arbitrary length. We have shown that conformers do not require strict optimisation to obtain high quality spectra, and that a two-step subsystem optimisation (i.e. optimising the solvent while keeping the solute frozen, and subsequently optimising the solute while keeping the solvent frozen) yields spectra of the same quality as entire system optimisations. The microsolvation of both formally charged and neutral zwitterionic histidine species can yield calculated spectra that converge towards the experimental spectra. Chapter 7

Conclusion and Future Work

7.1 General Conclusions

We have presented a number of developments aimed at improving the construction and implementation of the QCTFF. The work in this thesis has focused primarily on conformational sampling, and how it can be used to construct optimal training sets. We have also demonstrated the applicability of the QCTFF to carbohydrates for the first time. Finally, we have undertaken a rigorous study on computing Raman and ROA spectra for charged, microsolvated species, as well as introduced various novel techniques for improving both the computational speed and accuracy of the ROA calculations.

In Chapter3, we presented a conformational sampling methodology that is both computationally tractable, and in principle, more representative of the sampling one may obtain from the ab initio PES than one constructed by a classical force field. The tyche methodology is ideal for the construction of conformers with

331 332 CHAPTER 7. CONCLUSION AND FUTURE WORK which a machine learning model can be constructed. It naturally implements a form of importance sampling in the form of transitioning between seeding geome- tries based on Boltzmann weights. It also provides both large-scale conformational sampling, in addition to finer sampling of the higher frequency degrees of freedom. These properties make it an ideal conformational sampling tool for the construc- tion of machine learning models.

It may perhaps be feasible to approximately update the Hessian of the system with respect to a distortion. Hessian updating methods are frequently invoked during ab initio geometry optimisations[291], and for small displacements are relatively accurate. By use of these Hessian-updating methodologies, one could re-evaluate the normal modes of the system once it has been displaced from the seeding ge- ometry, and subsequently vibrate along these new normal modes. However, it is anticipated that such a methodology would significantly increase the computa- tional power required for sampling.

In Chapter4, we demonstrated that the atomic multipole moments of a set of carbohydrates are amenable to the machine learning technique kriging. Whilst this has been done in the past for a variety of chemical species including naturally occurring amino acids, this is the first foray into the field of glycobiology.

Kriging is able to capture the conformational dependence of the multipole moments and make predictions, such that the error in the electrostatic energy relative to that derived from ab initio data is encouraging, given the popular aim is to obtain errors below 4 kJ mol-1. Indeed, the presented methodology is immediately extensible to any term arising in an energetic decomposition of a system. If some quantity is conformationally dependent, then the dependence can be modelled by kriging.

As such, an entire force field can be parameterized by the current methodology, 7.1. GENERAL CONCLUSIONS 333 reproducing ab initio quantities for use in classical MD. This route is preferable to the computationally intensive approach of ab initio MD.

Chapter5 presented a number of methodological developments for the selection of an optimal training set for kriging. The greedy subset selection algorithms have been reformulated to use a novel selection function. This novel selection function results in improved accuracy of the kriging model, relative to one constructed by a random subset selection. However, work is still required to optimise this methodology and yield “ideal” training sets.

A novel subset selection algorithm, the Iterative Voronoi algorithm, has been pre- sented. However, this methodology is only viable for low-dimensional cases, and so cannot be implemented for our purposes. In spite of this, if it were possible to reduce the dimensionality of the systems in some way, it may be feasible to implement the Iterative Voronoi algorithm in the future and establish whether it is a successful subset selection methodology for larger systems.

We have presented a novel means for defining the Domain of Applicability of a machine learning model. By approximating the “restricted affine hull” from a set of points by use of a greedy algorithm to construct the largest d-polytope on the points, it has been shown that one can approximately determine whether a point lies within the training set of a machine learning model in low-dimensional spaces. Further work is, however, required, to make the methodology more successful in higher dimensional spaces. The approximation of the DoA of a kriging model is of utility in our work since it allows one to ascertain whether a kriging model is predictive in those relevant regions of conformational space.

Finally in Chapter6, we investigated and presented the results for a number of novel methodologies that can be used in the calculation of ab initio Raman and 334 CHAPTER 7. CONCLUSION AND FUTURE WORK

ROA spectra of solvated systems. We are suggestively vague in our use of the term “solvated systems”, since we see no reason why our methodology cannot be applied to systems outside of the domain of peptides, such as DNA bases[420, 421]. A novel conformational sampling methodology allows for the unambiguous extrac- tion of a set of mutually diverse conformers from an MD trajectory of arbitrary length. We have shown that conformers do not require strict optimisation to obtain high quality spectra, and that a two-step subsystem optimisation (i.e. optimising the solvent while keeping the solute frozen, and subsequently optimising the so- lute while keeping the solvent frozen) yields spectra of the same quality as entire system optimisations. The microsolvation of both formally charged and neutral zwitterionic histidine species can yield calculated spectra that converge towards the experimental spectra.

7.2 Future Work

It is anticipated that the QCTFF will be ready for commercial release in the near future. Indeed, soon to be published data shows that geometry optimisation of a number of non-trivial molecules with QCTFF results in optimisation to energetic minima that are within 1 kJ mol-1 of the ab initio minima. A local version of dl poly 4.0 has been modified to read and use kriging models over the course of a MD simulation, so there is no obstacle to prediction of dynamical quantities and other quantities associated with dynamical trajectories.

In spite of the success of geometry optimisation, a great deal of work will need to be conducted in using the kriging models as efficiently as possible. We have spoken in some detail about the memory requirements associated with a force field comprising kriging models in AppendixB. However, actually invoking the kriging 7.2. FUTURE WORK 335 models over the course of a MD simulation is computationally-intensive relative to a conventional classical force field, where one can access and manipulate a database of force field parameters in high-speed memory. Therefore, some strategy must be devised to handle both the transferral of kriging models to high-speed memory, and subsequently make predictions from the kriging models. Considering the va- riety of high-performance computing (HPC) architectures, a number of solutions are available to efficiently use the kriging models. For instance, one could take advantage of the massively parallel architectures of HPC machines[422]. Future work could also look towards employing more advanced hardware, such as graphi- cal processing units (GPUs)[423], field-programmable gate arrays (FPGAs)[424] or application-specific integrated circuits (ASICs)[425], which can vastly outperform standard HPC systems by use of, for instance, enhanced pipelining capabilities.

It is worth noting that a number of varieties of kriging exist. The version that we have embellished upon in Section 2.4.2 is referred to as “Ordinary Kriging”[426], and is the simplest implementation available. However, a number of alternatives do exist. For instance, “Noisy Kriging”[427] is used when there is a known level of uncertainty in the data points used to construct the training set. The kriging model subsequently accounts for the uncertainty in the data points by not restricting the kriging model to pass directly through training points. A number of alternatives do exist, such as “Nugget Kriging”[428] and “Universal Kriging”[429].

A number of conceptual problems remain and require resolution. The first of these has been alluded to throughout this work; we require some means of defining an atom type so that the kriging models can be used for more than a single atom in a system. Work has previously been undertaken within our group to find some numerical criterion for assessing the transferability of a kriging model[430]. It was thought that if the kriging hyperparameters were similar between two kriging 336 CHAPTER 7. CONCLUSION AND FUTURE WORK models, the models would be transferable. However, this line of analysis has proven complex, and is the subject of continual research.

A particularly desirable feature of ab initio MD is that the associated dynamics are reactive. Since the electrons are explicitly modelled, chemical bonds can be broken and formed, and the molecular graph can be altered over the course of the simulation. Whilst there exists a small literature on reactive classical force fields[431], if we are to make the QCTFF a valid competitor to the ab initio MD schemes, some level of reactivity must be permitted. One could perhaps do so by constructing kriging models for each of the distinct molecular graphs of a sys- tem along the reaction coordinate. However, when transition states are unknown, constructing kriging models for these species is not feasible. As an alternative, it may be possible to construct kriging models for only the reactants and products of a particular reaction scheme, and smoothly interpolating “between” the kriging models over the course of the reaction to obtain approximations to atomic quanti- ties during the intermediary stages of the reaction. We have not investigated such a methodology, but some work is certainly warranted.

In keeping with this, we anticipate that the modelling of open shell and excited state systems could perhaps be a stumbling block to the QCTFF. Popular IQA software suites do not offer compatibility with post-Hartree Fock or multireference wavefunctions. Obtaining valid energetic quantities for atoms in these states could be problematic, but this is speculation on the author’s part. The IQA methodology is being extended so that a number of other wavefunction-based methods can be handled[432], so perhaps no problems will be encountered when generalising further. It is also worth noting that IQA decompositions can be undertaken for the likes of transition metals and heavy elements[433, 434], but the effects of including pseudopotentials in wavefunction construction has unclear effects on the resultant 7.2. FUTURE WORK 337

IQA calculations, and is not particularly well-documented.

There is nothing that restricts the training of the QCTFF to biomolecular systems. So long as the software exists for treating the molecular system with QCT and IQA, then kriging models can be constructed and subsequently used in MD. Therefore, there are technically no barriers to the construction of a generally applicable force field. It is hoped that the force field will prove to be of a great deal of use to the computational chemistry community in the near future. Bibliography

[1] Paul AM Dirac. Quantum mechanics of many-electron systems. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 123(792):714–733, 1929.

[2] Walter Kohn. Nobel lecture: Electronic structure of matter—wave functions and density functionals. Reviews of Modern Physics, 71(5):1253, 1999.

[3] Robert B Laughlin and David Pines. The theory of everything. Proceedings of the National Academy of Sciences of the United States of America, pages 28–31, 2000.

[4] Garnet Kin-Lic Chan and Sandeep Sharma. Solving problems with strong correlation using the density matrix renormalization group (dmrg). In Solv- ing The Schrodinger Equation: Has Everything Been Tried?, pages 43–60. Imperial College Press, 2011.

[5] Richard FW Bader. Atoms in molecules. Wiley Online Library, 1990.

[6] Lorenz S Cederbaum. The exact molecular wavefunction as a product of an electronic and a nuclear wavefunction. Journal of Chemical Physics, 138(22):224110, 2013.

338 BIBLIOGRAPHY 339

[7] E Deumens, A Diz, R Longo, and Y Ohrn.¨ Time-dependent theoretical treatments of the dynamics of electrons and nuclei in molecular systems. Reviews of Modern Physics, 66(3):917, 1994.

[8] Dominik Marx and Jurg Hutter. Ab initio molecular dynamics: Theory and implementation. Modern Methods and Algorithms of Quantum Chemistry, 1(301-449):141, 2000.

[9] J¨orgBehler. Neural network potential-energy surfaces in chemistry: a tool for large-scale simulations. Physical Chemistry Chemical Physics, 13(40):17930–17955, 2011.

[10] Murray S Daw and Michael I Baskes. Embedded-atom method: Derivation and application to impurities, surfaces, and other defects in metals. Physical Review B, 29(12):6443, 1984.

[11] J Tersoff. New empirical model for the structural properties of silicon. Phys- ical Review Letters, 56(6):632, 1986.

[12] Donald W Brenner and Barbara J Garrison. Dissociative valence force field potential for silicon. Physical Review B, 34(2):1304, 1986.

[13] Fu-Ming Tao, S Drucker, RC Cohen, and William Klemperer. Ab initio potential energy surface and dynamics of he–co. Journal of Chemical Physics, 101(10):8680–8686, 1994.

[14] Carl De Boor. A practical guide to splines, volume 27. Springer-Verlag New York, 1978.

[15] Josef Ischtwan and Michael A Collins. Molecular potential energy surfaces by interpolation. Journal of Chemical Physics, 100(11):8080–8088, 1994. 340 BIBLIOGRAPHY

[16] Albert P Bartok, Mike C Payne, Risi Kondor, and Gabor Csanyi. Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Physical Review Letters, 104(13):136403, 2010.

[17] Zhenwei Li, James R Kermode, and Alessandro De Vita. Molecular dynamics with on-the-fly machine learning of quantum-mechanical forces. Physical Review Letters, 114(9):096405, 2015.

[18] Dmitrii E Makarov and Horia Metiu. Fitting potential-energy surfaces: a search in the function space by directed genetic programming. Journal of Chemical Physics, 108(2):590–598, 1998.

[19] Mark Tuckerman, Kari Laasonen, Michiel Sprik, and Michele Parrinello. Ab initio molecular dynamics simulation of the solvation and transport of h3o+ and oh-ions in water. The Journal of Physical Chemistry, 99(16):5749–5752, 1995.

[20] Andrea Zen, Ye Luo, Guglielmo Mazzola, Leonardo Guidoni, and Sandro Sorella. Ab initio molecular dynamics simulation of liquid water by quantum monte carlo. The Journal of chemical physics, 142(14):144111, 2015.

[21] Dudley R Herschbach and Victor W Laurie. Anharmonic potential con- stants and their dependence upon bond length. Journal of Chemical Physics, 35(2):458–464, 1961.

[22] Steven W Bunte and Huai Sun. Molecular modeling of energetic materials: the parameterization and validation of nitrate esters in the compass force field. Journal of Physical Chemistry B, 104(11):2477–2489, 2000.

[23] Kim Palmo, Berit Mannfors, Noemi G Mirkin, and Samuel Krimm. Potential energy functions: from consistent force fields to spectroscopically determined polarizable force fields. Biopolymers, 68(3):383–394, 2003. BIBLIOGRAPHY 341

[24] DL Gray and AG Robiette. The anharmonic force field and equilibrium structure of methane. Molecular Physics, 37(6):1901–1920, 1979.

[25] Gary S Grest and Kurt Kremer. Molecular dynamics simulation for polymers in the presence of a heat bath. Physical Review A, 33(5):3628, 1986.

[26] Jay W Ponder, Chuanjie Wu, Pengyu Ren, Vijay S Pande, John D Chodera, Michael J Schnieders, Imran Haque, David L Mobley, Daniel S Lambrecht, Robert A DiStasio Jr, et al. Current status of the amoeba polarizable force field. Journal of Physical Chemistry B, 114(8):2549–2564, 2010.

[27] Behnam Vessal. Simulation studies of silicates and phosphates. Journal of Non-Crystalline Solids, 177:103–124, 1994.

[28] Naoki Kumagai, Katsuyuki Kawamura, and Toshio Yokokawa. An inter- atomic potential model for h2o: applications to water and ice polymorphs. Molecular Simulation, 12(3-6):177–186, 1994.

[29] LA Carreira. Determination of the torsional potential function of 1, 3- butadiene. The Journal of Chemical Physics, 62(10):3851–3854, 1975.

[30] U Koch, PLA Popelier, and AJ Stone. Conformational dependence of atomic multipole moments. Chemical physics letters, 238(4):253–260, 1995.

[31] Michael G Darley and Paul LA Popelier. Role of short-range electrostatics in torsional potentials†. The Journal of Physical Chemistry A, 112(50):12954– 12965, 2008.

[32] Frank Jensen. Introduction to computational chemistry. John Wiley & Sons, 2013.

[33] Benedetta Mennucci. Polarizable continuum model. Wiley Interdisciplinary Reviews: Computational Molecular Science, 2(3):386–404, 2012. 342 BIBLIOGRAPHY

[34] Andreas Klamt and G Schuurmann. Cosmo: a new approach to dielectric screening in solvents with explicit expressions for the screening energy and its gradient. Journal of the Chemical Society, Perkin Transactions 2, 1(5):799– 805, 1993.

[35] U Chandra Singh and Peter A Kollman. An approach to computing electro- static charges for molecules. Journal of Computational Chemistry, 5(2):129– 145, 1984.

[36] Yong Duan, Chun Wu, Shibasish Chowdhury, Mathew C Lee, Guoming Xiong, Wei Zhang, Rong Yang, Piotr Cieplak, Ray Luo, Taisung Lee, et al. A point-charge force field for molecular mechanics simulations of proteins based on condensed-phase quantum mechanical calculations. Journal of computa- tional chemistry, 24(16):1999–2012, 2003.

[37] Wilfried J Mortier, Karin Van Genechten, and Johann Gasteiger. Elec- tronegativity equalization: application and parametrization. Journal of the American Chemical Society, 107(4):829–835, 1985.

[38] SR Cox and DE Williams. Representation of the molecular electrostatic potential by a net atomic charge model. Journal of Computational chemistry, 2(3):304–323, 1981.

[39] Thomas D Rasmussen and Frank Jensen. Force field modelling of conforma- tional energies. Molecular Simulation, 30(11-12):801–806, 2004.

[40] Michelle Miller Francl, Christina Carey, Lisa Emily Chirlian, and David M Gange. Charges fit to electrostatic potentials. 11. can atomic charges be un- ambiguously fit to electrostatic potentials. Journal of computational chem- istry, 17(3):367–383, 1996. BIBLIOGRAPHY 343

[41] Christopher I Bayly, Piotr Cieplak, Wendy Cornell, and Peter A Kollman. A well-behaved electrostatic potential based method using charge restraints for deriving atomic charges: the resp model. The Journal of Physical Chemistry, 97(40):10269–10280, 1993.

[42] Michelle Miller Francl and Lisa Emily Chirlian. The pluses and minuses of mapping atomic charges to electrostatic potentials. Reviews in computational chemistry, 14:1–31, 2000.

[43] Michael Patra, Mikko Karttunen, Marja T Hyv¨onen, E Falck, Peter Lindqvist, and Ilpo Vattulainen. Molecular dynamics simulations of lipid bilayers: major artifacts due to truncating electrostatic interactions. Bio- physical journal, 84(6):3636–3645, 2003.

[44] Yoshiteru Yonetani. Liquid water simulation: A critical examination of cutoff length. The Journal of chemical physics, 124(20):204501, 2006.

[45] Daan Frenkel and Berend Smit. Understanding molecular simulation: from algorithms to applications. Computational sciences series, 1:1–638, 2002.

[46] Paul P Ewald. Die berechnung optischer und elektrostatischer gitterpoten- tiale. Annalen der Physik, 369(3):253–287, 1921.

[47] Christopher J Fennell and J Daniel Gezelter. Is the ewald summation still necessary? pairwise alternatives to the accepted standard for long-range electrostatics. The Journal of chemical physics, 124(23):234104, 2006.

[48] Abdulnour Y Toukmaji and John A Board. Ewald summation techniques in perspective: a survey. Computer physics communications, 95(2):73–92, 1996. 344 BIBLIOGRAPHY

[49] Celeste Sagui and Thomas A Darden. Molecular dynamics simulations of biomolecules: long-range electrostatic effects. Annual review of biophysics and biomolecular structure, 28(1):155–179, 1999.

[50] John Edward Lennard-Jones. Cohesion. Proceedings of the Physical Society, 43(5):461, 1931.

[51] Ali Khalaf Al-Matar and David A Rockstraw. A generating equation for mixing rules and two new mixing rules for interatomic potential energy pa- rameters. Journal of computational chemistry, 25(5):660–668, 2004.

[52] Fritz London. The general theory of molecular forces. Trans. Faraday Soc., 33:8b–26, 1937.

[53] JHR Clarke, W Smith, and LV Woodcock. Short range effective potentials for ionic fluids. The Journal of chemical physics, 84(4):2290–2294, 1986.

[54] John D Weeks, David Chandler, and Hans C Andersen. Role of repulsive forces in determining the equilibrium structure of simple liquids. The Journal of chemical physics, 54(12):5237–5247, 1971.

[55] Christopher J Cramer. Essentials of computational chemistry: theories and models. John Wiley & Sons, 2013.

[56] John C Tully. Nonadiabatic molecular dynamics. International Journal of Quantum Chemistry, 40(S25):299–309, 1991.

[57] Dominik Marx and J¨urgHutter. Ab initio molecular dynamics: basic theory and advanced methods. Cambridge University Press, 2009.

[58] Oleg V Prezhdo. Mean field approximation for the stochastic schr¨odinger equation. Journal of Chemical Physics, 111(18):8366–8377, 1999. BIBLIOGRAPHY 345

[59] Srinivasan S Iyengar, H Bernhard Schlegel, John M Millam, Gregory A Voth, Gustavo E Scuseria, and Michael J Frisch. Ab initio molecular dynamics: Propagating the density matrix with gaussian orbitals. ii. generalizations based on mass-weighting, idempotency, energy conservation and choice of initial conditions. Journal of Chemical Physics, 115(22):10291–10302, 2001.

[60] H Bernhard Schlegel, John M Millam, Srinivasan S Iyengar, Gregory A Voth, Andrew D Daniels, Gustavo E Scuseria, and Michael J Frisch. Ab initio molecular dynamics: Propagating the density matrix with gaussian orbitals. Journal of Chemical Physics, 114(22):9758–9763, 2001.

[61] Claude Leforestier. Classical trajectories using the full abinitio potential − − energy surface H + CH4 → CH4 + H . Journal of Chemical Physics, 68(10):4406–4410, 1978.

[62] Dominik Marx and Michele Parrinello. Ab initio path integral molecular dynamics: Basic ideas. The Journal of chemical physics, 104(11):4077–4082, 1996.

[63] Richard Car and Mark Parrinello. Unified approach for molecular dynamics and density-functional theory. Physical review letters, 55(22):2471, 1985.

[64] Fred L Hirshfeld. Bonded-atom fragments for describing molecular charge densities. Theoretica Chimica Acta, 44(2):129–138, 1977.

[65] Richard FW Bader and Ch´erifF Matta. Atomic charges are measurable quantum expectation values: a rebuttal of criticisms of qtaim charges. Jour- nal of Physical Chemistry A, 108(40):8385–8394, 2004.

[66] Celia Fonseca Guerra, Jan-Willem Handgraaf, Evert Jan Baerends, and F Matthias Bickelhaupt. Voronoi deformation density (vdd) charges: As- 346 BIBLIOGRAPHY

sessment of the mulliken, bader, hirshfeld, weinhold, and vdd methods for charge analysis. Journal of Computational Chemistry, 25(2):189–210, 2004.

[67] Roman F Nalewajski and Robert G Parr. Information theory thermodynam- ics of molecules and their hirshfeld fragments. Journal of Physical Chemistry A, 105(31):7391–7400, 2001.

[68] Paul W Ayers. Information theory, the shape function, and the hirshfeld atom. Theoretical Chemistry Accounts, 115(5):370–378, 2006.

[69] Weitao Yang and Robert G Parr. Hardness, softness, and the fukui function in the electronic theory of metals and catalysis. Proceedings of the National Academy of Sciences, 82(20):6723–6726, 1985.

[70] Frank De Proft, Christian Van Alsenoy, Anik Peeters, Wilfried Langenaeker, and Paul Geerlings. Atomic charges, dipole moments, and fukui functions using the hirshfeld partitioning of the electron density. Journal of Compu- tational Chemistry, 23(12):1198–1209, 2002.

[71] A Krishtal, Patrick Senet, M Yang, and C Van Alsenoy. A hirshfeld par- titioning of polarizabilities of water clusters. Journal of Chemical Physics, 125(3):034312, 2006.

[72] Mark A Spackman and Patrick G Byrom. A novel definition of a molecule in a crystal. Chemical Physics Letters, 267(3):215–220, 1997.

[73] Axel Becke, Cherif F Matta, and Russell J Boyd. The quantum theory of atoms in molecules: from solid state to DNA and drug design. John Wiley & Sons, 2007.

[74] PLA Popelier, FM Aicken, and SE O’Brien. Atoms in molecules. Chemical Modelling: Applications and Theory, 1:143A198,` 2000. BIBLIOGRAPHY 347

[75] Nathaniel OJ Malcolm and Paul LA Popelier. An algorithm to delineate and integrate topological basins in a three-dimensional quantum mechanical density function. Journal of computational chemistry, 24(10):1276–1282, 2003.

[76] PLA Popelier. Integration of atoms in molecules: a critical examination. Molecular Physics, 87(5):1169–1187, 1996.

[77] Paul Lode Albert Popelier, L Joubert, and DS Kosov. Convergence of the electrostatic interaction based on topological atoms. The Journal of Physical Chemistry A, 105(35):8254–8261, 2001.

[78] Eduardo FF Rodrigues, Eduardo L de S´a,and Roberto LA Haiduke. Electro- static properties of small molecules by means of atomic multipoles from the quantum theory of atoms in molecules. International Journal of Quantum Chemistry, 108(13):2417–2427, 2008.

[79] Thiago CF Gomes, Jo˜aoVi¸cozoda Silva Jr, Luciano N Vidal, Pedro AM Vazquez, and Roy E Bruns. Chelpg and qtaim atomic charge and dipole models for the infrared fundamental intensities of the fluorochloromethanes. Theoretical Chemistry Accounts, 121(3-4):173–179, 2008.

[80] Curt M Breneman and Kenneth B Wiberg. Determining atom-centered monopoles from molecular electrostatic potentials. the need for high sam- pling density in formamide conformational analysis. Journal of Computa- tional Chemistry, 11(3):361–373, 1990.

[81] Leon Cohen. Local kinetic energy in quantum mechanics. The Journal of Chemical Physics, 70(2):788–789, 1979.

[82] Richard FW Bader and PM Beddall. Virial field relationship for molecular 348 BIBLIOGRAPHY

charge distributions and the spatial partitioning of molecular properties. Journal of Chemical Physics, 56(7):3320–3329, 1972.

[83] James SM Anderson, Paul W Ayers, and Juan I Rodriguez Hernandez. How ambiguous is the local kinetic energy? The Journal of Physical Chemistry A, 114(33):8884–8895, 2010.

[84] Paul LA Popelier. Quantum Chemical Topology. Springer, 2016.

[85] Keiji Morokuma. Molecular orbital studies of hydrogen bonds. iii. c= o··· h–o hydrogen bond in h2co··· h2o and h2co··· 2h2o. The Journal of Chemical Physics, 55(3):1236–1244, 1971.

[86] Tom Ziegler and Arvi Rauk. On the calculation of bonding energies by the hartree fock slater method. Theoretica chimica acta, 46(1):1–10, 1977.

[87] Gernot Frenking and Moritz Hopffgarten. Calculation of bonding properties. Encyclopedia of Inorganic and Bioinorganic Chemistry, 2009.

[88] Per-Olov Lowdin. Quantum theory of many-particle systems. i. physical interpretations by means of density matrices, natural spin-orbitals, and con- vergence problems in the method of configurational interaction. Physical Review, 97(6):1474, 1955.

[89] Rev McWeeny. Some recent advances in density matrix theory. Reviews of Modern Physics, 32(2):335, 1960.

[90] MA Blanco, A Martin Pendas, and E Francisco. Interacting quantum atoms: a correlated energy decomposition scheme based on the quantum theory of atoms in molecules. Journal of Chemical Theory and Computa- tion, 1(6):1096–1109, 2005. BIBLIOGRAPHY 349

[91] Lev Davidovich Landau and Evgenii Mikhailovich Lifshitz. The classical theory of fields. Pergamon, 1971.

[92] Anthony Stone. The theory of intermolecular forces. OUP Oxford, 2013.

[93] Richard N Zare. Angular momentum: understanding spatial aspects in chem- istry and physics. Wiley-Interscience, 2013.

[94] Christof Hattig. Recurrence relations for the direct calculation of spherical multipole interaction tensors and coulomb-type interaction energies. Chem- ical physics letters, 260(3):341–351, 1996.

[95] James F Harrison. On the role of the electron density difference in the interpretation of molecular properties. The Journal of chemical physics, 119(16):8763–8764, 2003.

[96] M Rafat and PLA Popelier. A convergent multipole expansion for 1, 3 and 1, 4 coulomb interactions. The Journal of chemical physics, 124(14):144102, 2006.

[97] PLA Popelier and M Rafat. The electrostatic potential generated by topo- logical atoms: a continuous multipole method leading to larger convergence regions. Chemical physics letters, 376(1):148–153, 2003.

[98] CJF Solano, A Martin Pendas, E Francisco, MA Blanco, and PLA Pope- lier. Convergence of the multipole expansion for 1, 2 coulomb interactions: The modified multipole shifting algorithm. The Journal of chemical physics, 132(19):194110, 2010.

[99] Arthur A Frost. Floating spherical gaussian orbital model of molecular struc- ture. i. computational procedure. lih as an example. The Journal of Chemical Physics, 47(10):3707–3713, 1967. 350 BIBLIOGRAPHY

[100] S Francis Boys. Electronic wave functions. i. a general method of calcula- tion for the stationary states of any molecular system. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sci- ences, 200(1063):542–554, 1950.

[101] AJ Stone. Distributed multipole analysis, or how to describe a molecular charge distribution. Chemical Physics Letters, 83(2):233–239, 1981.

[102] AJ Stone and M Alderton. Distributed multipole analysis: methods and applications. Molecular Physics, 56(5):1047–1064, 1985.

[103] Fabienne Vign´e-Maederand Pierre Claverie. The exact multicenter multipo- lar part of a molecular charge distribution and its simplified representations. The Journal of chemical physics, 88(8):4934–4948, 1988.

[104] Anthony J Stone. Distributed multipole analysis: Stability for large basis sets. Journal of Chemical Theory and Computation, 1(6):1128–1132, 2005.

[105] Joseph D Yesselman, Daniel J Price, Jennifer L Knight, and Charles L Brooks. Match: An atom-typing toolset for molecular mechanics force fields. Journal of Computational Chemistry, 33(2):189–202, 2012.

[106] RFW Bader, A Larouche, C Gatti, MT Carroll, PJ MacDougall, and KB Wiberg. Properties of atoms in molecules: dipole moments and transfer- ability of properties. Journal of Chemical Physics, 87(2):1142–1152, 1987.

[107] RFW Bader, TA Keith, KM Gough, and KE Laidig. Properties of atoms in molecules: additivity and transferability of group polarizabilities. Molecular Physics, 75(5):1167–1189, 1992.

[108] SL Price, CH Faerman, and CW Murray. Toward accurate transferable elec- trostatic models for polypeptides: A distributed multipole study of blocked BIBLIOGRAPHY 351

amino acid residue charge distributions. Journal of Computational Chem- istry, 12(10):1187–1197, 1991.

[109] Carlos H Faerman and Sarah L Price. A transferable distributed multipole model for the electrostatic interactions of peptides and amides. Journal of the American Chemical Society, 112(12):4915–4926, 1990.

[110] Wijnand TM Mooij, Frans B van Duijneveldt, Jeanne GCM van Duijneveldt- van de Rijdt, and Bouke P van Eijck. Transferable ab initio intermolecular potentials. 1. derivation from methanol dimer and trimer calculations. Jour- nal of Physical Chemistry A, 103(48):9872–9882, 1999.

[111] Revati Kumar, Fang-Fang Wang, Glen R Jenness, and Kenneth D Jordan. A second generation distributed point polarizable water model. Journal of Chemical Physics, 132(1):014309, 2010.

[112] DS Kosov, Paul Lode Albert Popelier, et al. Convergence of the multipole expansion for electrostatic potentials of finite topological atoms. Journal of Chemical Physics, 113(10):3969–3974, 2000.

[113] Keith E Laidig. General expression for the spatial partitioning of the mo- ments and multipole moments of molecular charge distributions. Journal of Physical Chemistry, 97(49):12760–12767, 1993.

[114] Christopher E Whitehead, Curt M Breneman, Nagamani Sukumar, and M Dominic Ryan. Transferable atom equivalent multicentered multipole ex- pansion method. Journal of Computational Chemistry, 24(4):512–529, 2003.

[115] Curt M Breneman, Tracy R Thompson, Marlon Rhem, and Mei Dung. Elec- tron density modeling of large systems using the transferable atom equivalent method. Computers & Chemistry, 19(3):161–179, 1995. 352 BIBLIOGRAPHY

[116] Paul LA Popelier and Fiona M Aicken. Atomic properties of amino acids: Computed atom types as a guide for future force-field design. ChemPhysChem, 4(8):824–829, 2003.

[117] PLA Popelier, M Devereux, and M Rafat. The quantum topological electro- static potential as a probe for functional group transferability. Acta Crystal- lographica Section A: Foundations of Crystallography, 60(5):427–433, 2004.

[118] Yongna Yuan, Matthew JL Mills, and Paul LA Popelier. Multipolar electro- statics for proteins: Atom–atom electrostatic energies in crambin. Journal of Computational Chemistry, 35(5):343–359, 2014.

[119] Tibor Koritsanszky, Anatoliy Volkov, and Philip Coppens. Aspherical-atom scattering factors from molecular wave functions. 1. transferability and con- formation dependence of atomic electron densities of peptides within the multipole formalism. Acta Crystallographica Section A: Foundations of Crys- tallography, 58(5):464–472, 2002.

[120] Matthew JL Mills and Paul LA Popelier. Polarisable multipolar electrostat- ics from the machine learning method kriging: an application to alanine. Theoretical Chemistry Accounts, 131(3):1–16, 2012.

[121] Shaun M Kandathil, Timothy L Fletcher, Yongna Yuan, Joshua Knowles, and Paul LA Popelier. Accuracy and tractability of a kriging model of intramolecular polarizable multipolar electrostatics and its application to histidine. Journal of Computational Chemistry, 34(21):1850–1861, 2013.

[122] C Jelsch, V Pichon-Pesme, C Lecomte, and A Aubry. Transferability of multipole charge-density parameters: application to very high resolution oligopeptide and protein structures. Acta Crystallographica Section D: Bio- logical Crystallography, 54(6):1306–1318, 1998. BIBLIOGRAPHY 353

[123] NIELS K Hansen and PHILIP Coppens. Testing aspherical atom refine- ments on small-molecule data sets. Acta Crystallographica Section A: Crys- tal Physics, Diffraction, Theoretical and General Crystallography, 34(6):909– 921, 1978.

[124] Nicolas Muzet, Benoˆıt Guillot, Christian Jelsch, Eduardo Howard, and Claude Lecomte. Electrostatic complementarity in an aldose reductase com- plex from ultra-high-resolution crystallography and first-principles calcula- tions. Proceedings of the National Academy of Sciences, 100(15):8742–8747, 2003.

[125] Anatoliy Volkov, Tibor Koritsanszky, and Philip Coppens. Combination of the exact potential and multipole methods (ep/mm) for evaluation of inter- molecular electrostatic interaction energies with pseudoatom representation of molecular electron densities. Chemical Physics Letters, 391(1):170–175, 2004.

[126] Frank H Allen. The cambridge structural database: a quarter of a million crystal structures and rising. Acta Crystallographica Section B: Structural Science, 58(3):380–388, 2002.

[127] Anatoliy Volkov, Xue Li, Tibor Koritsanszky, and Philip Coppens. Ab initio quality electrostatic atomic and molecular properties including intermolecu- lar energies from a transferable theoretical pseudoatom databank. Journal of Physical Chemistry A, 108(19):4283–4300, 2004.

[128] Magdalena Woinska and Paulina M Dominiak. Transferability of atomic multipoles in amino acids and peptides for various density partitions. Journal of Physical Chemistry A, 117(7):1535–1547, 2011. 354 BIBLIOGRAPHY

[129] Paulina M Dominiak, Anatoliy Volkov, Xue Li, Marc Messerschmidt, and Philip Coppens. A theoretical databank of transferable aspherical atoms and its application to electrostatic interaction energy calculations of macro- molecules. Journal of Chemical Theory and Computation, 3(1):232–247, 2007.

[130] V Pichon-Pesme, C Lecomte, and H Lachekar. On building a data bank of transferable experimental electron density parameters applicable to polypep- tides. Journal of Physical Chemistry, 99(16):6242–6250, 1995.

[131] Katarzyna N Jarzembska and Paulina M Dominiak. New version of the theo- retical databank of transferable aspherical pseudoatoms, ubdb2011–towards nucleic acid modelling. Acta Crystallographica Section A: Foundations of Crystallography, 68(1):139–147, 2012.

[132] Paulina M Dominiak, Enrique Espinosa, and J´anosG Angy´an.Intermolecu-´ lar interaction energies from experimental charge density studies. In Modern Charge-Density Analysis, pages 387–433. Springer, 2011.

[133] KY Sanbonmatsu and C-S Tung. High performance computing in biology: multimillion atom simulations of nanoscale systems. Journal of Structural Biology, 157(3):470–480, 2007.

[134] Laxmikant Kale, Robert Skeel, Milind Bhandarkar, Robert Brunner, At- tila Gursoy, Neal Krawetz, James Phillips, Aritomo Shinozaki, Krishnan Varadarajan, and Klaus Schulten. Namd2: greater scalability for paral- lel molecular dynamics. Journal of Computational Physics, 151(1):283–312, 1999.

[135] G Andres Cisneros, Mikko Karttunen, Pengyu Ren, and Celeste Sagui. BIBLIOGRAPHY 355

Classical electrostatics for biomolecular simulations. Chemical reviews, 114(1):779–814, 2013.

[136] Sarah L Price and Anthony J Stone. Electrostatic models for polypeptides: can we assume transferability? Journal of the Chemical Society, Faraday Transactions, 88(13):1755–1763, 1992.

[137] Pawel Kedzierski and W Andrzej Sokalski. Analysis of the transferability of atomic multipoles for amino acids in modeling macromolecular charge distri- bution from fragments. Journal of Computational Chemistry, 22(10):1082– 1097, 2001.

[138] Nuria Plattner and Markus Meuwly. The role of higher co-multipole mo- ments in understanding the dynamics of photodissociated carbonmonoxide in myoglobin. Biophysical journal, 94(7):2505–2515, 2008.

[139] Nuria Plattner and Markus Meuwly. Higher order multipole moments for molecular dynamics simulations. Journal of molecular modeling, 15(6):687– 694, 2009.

[140] Panagiotis G Karamertzanis and Sarah L Price. Energy minimization of crystal structures containing flexible molecules. Journal of Chemical Theory and Computation, 2(4):1184–1199, 2006.

[141] Christof Hattig. On the calculation of derivatives for coulomb-type inter- action energies and general anisotropic pair potentials. Chemical physics letters, 268(5):521–530, 1997.

[142] PLA Popelier and AJ Stone. Formulae for the first and second derivatives of anisotropie potentials with respect to geometrical parameters. Molecular Physics, 82(2):411–425, 1994. 356 BIBLIOGRAPHY

[143] Paul LA Popelier, Anthony J Stone, and David J Wales. Topography of potential-energy surfaces for van der waals complexes. Faraday Discussions, 97:243–264, 1994.

[144] Maurice Leslie. Dl multi—a molecular dynamics program to use distributed multipole electrostatic models to simulate the dynamics of organic . Molecular Physics, 106(12-13):1567–1578, 2008.

[145] Dennis M Elking, Lalith Perera, Robert Duke, Thomas Darden, and Lee G Pedersen. Atomic forces for geometry-dependent point multipole and gaus- sian multipole models. Journal of computational chemistry, 31(15):2702– 2713, 2010.

[146] Matthew JL Mills and Paul LA Popelier. Electrostatic forces: Formulas for the first derivatives of a polarizable, anisotropic electrostatic potential energy function based on machine learning. Journal of chemical theory and computation, 10(9):3840–3856, 2014.

[147] Celeste Sagui, Lee G Pedersen, and Thomas A Darden. Towards an accurate representation of electrostatics in classical force fields: Efficient implemen- tation of multipolar interactions in biomolecular simulations. The Journal of chemical physics, 120(1):73–87, 2004.

[148] Alan Grossfield, Pengyu Ren, and Jay W Ponder. Ion solvation thermody- namics from simulation with a polarizable force field. Journal of the Amer- ican Chemical Society, 125(50):15671–15682, 2003.

[149] Johnny C Wu, Jean-Philip Piquemal, Robin Chaudret, Peter Reinhardt, and Pengyu Ren. Polarizable molecular dynamics simulation of zn (ii) in water using the amoeba force field. Journal of chemical theory and computation, 6(7):2059–2070, 2010. BIBLIOGRAPHY 357

[150] Jean-Philip Piquemal, Lalith Perera, G Andr´esCisneros, Pengyu Ren, Lee G Pedersen, and Thomas A Darden. Towards accurate solvation dynamics of divalent cations in water using the polarizable amoeba force field: From energetics to structure. The Journal of chemical physics, 125(5):054511, 2006.

[151] Jay W Ponder and David A Case. Force fields for protein simulations. Ad- vances in Protein Chemistry, 66:27–86, 2003.

[152] Pengyu Ren and Jay W Ponder. Consistent treatment of inter-and in- tramolecular polarization in molecular mechanics calculations. Journal of computational chemistry, 23(16):1497–1506, 2002.

[153] T Liang and TR Walsh. Molecular dynamics simulations of peptide car- boxylate hydration. Physical Chemistry Chemical Physics, 8(38):4410–4419, 2006.

[154] T Liang and TR Walsh. Simulation of the hydration structure of glycyl- alanine. Molecular Simulation, 33(4-5):337–342, 2007.

[155] Jakub Kaminsky and Frank Jensen. Force field modeling of amino acid conformational energies. Journal of chemical theory and computation, 3(5):1774–1788, 2007.

[156] Thomas D Rasmussen, Pengyu Ren, Jay W Ponder, and Frank Jensen. Force field modeling of conformational energies: importance of multipole moments and intramolecular polarization. International Journal of Quantum Chem- istry, 107(6):1390–1395, 2007.

[157] Pengyu Ren and Jay W Ponder. Polarizable atomic multipole water model for molecular mechanics simulation. The Journal of Physical Chemistry B, 107(24):5933–5947, 2003. 358 BIBLIOGRAPHY

[158] Pengyu Ren and Jay W Ponder. Temperature and pressure dependence of the amoeba water model. The Journal of Physical Chemistry B, 108(35):13427– 13437, 2004.

[159] Yue Shi, Chuanjie Wu, Jay W Ponder, and Pengyu Ren. Multipole elec- trostatics in hydration free energy calculations. Journal of computational chemistry, 32(5):967–977, 2011.

[160] Alex Brown, Bastiaan J Braams, Kurt Christoffel, Zhong Jin, and Joel M Bowman. Classical and quasiclassical spectral analysis of ch5+ using an ab initio potential energy surface. The Journal of chemical physics, 119(17):8790–8793, 2003.

[161] AV Kazantsev, PG Karamertzanis, CS Adjiman, and CC Pantelides. Effi- cient handling of molecular flexibility in lattice energy minimization of or- ganic crystals. Journal of chemical theory and computation, 7(6):1998–2016, 2011.

[162] John Kuszewski, Angela M Gronenborn, and G Marius Clore. Improving the quality of nmr and crystallographic protein structures by means of a conformational database potential derived from structure databases. Protein Science, 5(6):1067–1080, 1996.

[163] Sandeep Patel and Charles L Brooks. Charmm fluctuating charge force field for proteins: I parameterization and application to bulk organic liquid simulations. Journal of computational chemistry, 25(1):1–16, 2004.

[164] Harry A Stern, F Rittner, Bruce J Berne, and Richard A Friesner. Combined fluctuating charge and polarizable dipole models: Application to a five-site water potential function. The Journal of chemical physics, 115(5):2237–2251, 2001. BIBLIOGRAPHY 359

[165] Michael Devereux, Nuria Plattner, and Markus Meuwly. Application of mul- tipolar charge models and molecular dynamics simulations to study stark shifts in inhomogeneous electric fields†. The Journal of Physical Chemistry A, 113(47):13199–13209, 2009.

[166] Christian Kramer, Peter Gedeck, and Markus Meuwly. Atomic multipoles: Electrostatic potential fit, local reference axis systems, and conformational dependence. Journal of computational chemistry, 33(20):1673–1688, 2012.

[167] Raghunathan Ramakrishnan and O Anatole von Lilienfeld. Machine learn- ing, quantum mechanics, and chemical compound space. Arχiv 1510.07512, 2015.

[168] Teuvo Kohonen. An introduction to neural computing. Neural Networks, 1(1):3–16, 1988.

[169] Teuvo Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43(1):59–69, 1982.

[170] John J Hopfield. Neural networks and physical systems with emergent col- lective computational abilities. Proceedings of the National Academy of Sci- ences, 79(8):2554–2558, 1982.

[171] J¨orgBehler and Michele Parrinello. Generalized neural-network representa- tion of high-dimensional potential-energy surfaces. Physical review letters, 98(14):146401, 2007.

[172] E Sanville, A Bholoa, Roger Smith, and Steven D Kenny. Silicon potentials investigated using density functional theory fitted neural networks. Journal of Physics: Condensed Matter, 20(28):285219, 2008. 360 BIBLIOGRAPHY

[173] Kwang-Hwi Cho, Kyoung Tai No, and Harold A Scheraga. A polarizable force field for water using an artificial neural network. Journal of molecular structure, 641(1):77–91, 2002.

[174] Steven Hobday, Roger Smith, and Joe Belbruno. Applications of neural networks to fitting interatomic potential functions. Modelling and Simulation in Materials Science and Engineering, 7(3):397, 1999.

[175] Chris M Handley, Glenn I Hawe, Douglas B Kell, and Paul LA Popelier. Optimal construction of a fast and accurate polarisable water potential based on multipole moments trained by machine learning. Physical Chemistry Chemical Physics, 11(30):6365–6376, 2009.

[176] S Houlding, SY Liem, and PLA Popelier. A polarizable high-rank quantum topological electrostatic potential developed using neural networks: Molec- ular dynamics simulations on the hydrogen fluoride dimer. International Journal of Quantum Chemistry, 107(14):2817–2827, 2007.

[177] Chris M Handley and Paul LA Popelier. Potential energy surfaces fit- ted by artificial neural networks. The Journal of Physical Chemistry A, 114(10):3371–3383, 2010.

[178] Igor V Tetko, David J Livingstone, and Alexander I Luik. Neural network studies. 1. comparison of overfitting and overtraining. Journal of chemical information and computer sciences, 35(5):826–833, 1995.

[179] Christopher KI Williams and Carl Edward Rasmussen. Gaussian Processes for Regression. MIT Press, 1996.

[180] David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003. BIBLIOGRAPHY 361

[181] Albert P Bartok and Gabor Csanyi. Gaussian approximation potentials: A brief tutorial introduction. International Journal of Quantum Chemistry, 115(16):1051–1057, 2015.

[182] J¨orgBehler. Atom-centered symmetry functions for constructing high- dimensional neural network potentials. The Journal of chemical physics, 134(7):074106, 2011.

[183] Albert P Bart´ok,Risi Kondor, and G´abor Cs´anyi. On representing chemical environments. Physical Review B, 87(18):184115, 2013.

[184] Bing Huang and O Anatole von Lilienfeld. Systematic improvement of molec- ular representations for machine learning models. Arχiv 1608.06194, 2016.

[185] DG Krige. Two-dimensional weighted moving average trend surfaces for ore- evaluation. Journal of the South African Institute of Mining and Metallurgy, 66:13–38, 1966.

[186] Joaquim Barris and Pilar Garcia Almirall. A density function of the appraisal value. calculation and evaluation of the empirical density function of the appraisal value based on comparison method, spatial correlation techniques, resampling methods, compliant with the spanish legal framework. Technical report, European Real Estate Society (ERES), 2011.

[187] Oghenekarho Okobiah, Saraju P Mohanty, and Elias Kougianos. Geostatistical-inspired fast layout optimisation of a nano-cmos thermal sen- sor. IET Circuits, Devices & Systems, 7(5):253–262, 2013.

[188] Hanefi Bayraktar and F Sezer Turalioglu. A kriging-based approach for locating a sampling site—in the assessment of air quality. Stochastic Envi- ronmental Research and Risk Assessment, 19(4):301–305, 2005. 362 BIBLIOGRAPHY

[189] Jerome Sacks, William J Welch, Toby J Mitchell, and Henry P Wynn. Design and analysis of computer experiments. Statistical Science, 4(4):409–423, 1989.

[190] Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimiza- tion, 13(4):455–492, 1998.

[191] Donald R Jones. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21(4):345–383, 2001.

[192] Steffen L Lauritzen. Time series analysis in 1880: A discussion of contri- butions made by tn thiele. International Statistical Review/Revue Interna- tionale de Statistique, pages 319–331, 1981.

[193] Hao Zhang and Yong Wang. Kriging and cross-validation for massive spatial data. Environmetrics, 21(3-4):290–304, 2010.

[194] Timothy W Simpson, Timothy M Mauery, John J Korte, and Farrokh Mis- tree. Kriging models for global approximation in simulation-based multidis- ciplinary design optimization. AIAA Journal, 39(12):2233–2241, 2001.

[195] Gene H Golub and Charles F Van Loan. Matrix Computations, volume 3. JHU Press, 2012.

[196] George J Davis and Max D Morris. Six factors which affect the condi- tion number of matrices associated with kriging. Mathematical Geology, 29(5):669–683, 1997.

[197] Yuhui Shi and Russell Eberhart. A modified particle swarm optimizer. In Evolutionary Computation Proceedings, 1998. IEEE World Congress on BIBLIOGRAPHY 363

Computational Intelligence., The 1998 IEEE International Conference on, pages 69–73. IEEE, 1998.

[198] Yuhui Shi and Russell C Eberhart. Empirical study of particle swarm opti- mization. In Evolutionary Computation, 1999. CEC 99. Proceedings of the 1999 Congress on, volume 3. IEEE, 1999.

[199] Zhi-Hui Zhan, Jun Zhang, Yun Li, and Henry Shu-Hung Chung. Adaptive particle swarm optimization. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(6):1362–1381, 2009.

[200] Maurice Clerc. The swarm and the queen: towards a deterministic and adaptive particle swarm optimization. In Evolutionary Computation, 1999. CEC 99. Proceedings of the 1999 Congress on, volume 3. IEEE, 1999.

[201] Morten Lovbjerg, Thomas Kiel Rasmussen, and Thiemo Krink. Hybrid par- ticle swarm optimiser with breeding and subpopulations. In Proceedings of the genetic and evolutionary computation conference, volume 2001, pages 469–476. Citeseer, 2001.

[202] Rainer Storn and Kenneth Price. Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. Journal of global optimization, 11(4):341–359, 1997.

[203] Nicodemo Di Pasquale, Stuart J Davie, and Paul LA Popelier. Optimization algorithms in optimal predictions of atomistic properties by kriging. Journal of chemical theory and computation, 12(4):1499–1513, 2016.

[204] SY Liem and PLA Popelier. High-rank quantum topological electrostatic potential: Molecular dynamics simulation of liquid hydrogen fluoride. The Journal of chemical physics, 119(8):4560–4566, 2003. 364 BIBLIOGRAPHY

[205] SY Liem, PLA Popelier, and M Leslie. Simulation of liquid water using a high-rank quantum topological electrostatic potential. International journal of quantum chemistry, 99(5):685–694, 2004.

[206] Michael G Darley, Chris M Handley, and Paul LA Popelier. Beyond point charges: dynamic polarization from neural net predicted multipole moments. Journal of chemical theory and computation, 4(9):1435–1448, 2008.

[207] Glenn I Hawe and Paul LA Popelier. A water potential based on multi- pole moments trained by machine learning-reducing maximum energy errors. Canadian Journal of Chemistry, 88(11):1104–1111, 2010.

[208] Matthew JL Mills and Paul LA Popelier. Intramolecular polarisable multipo- lar electrostatics from the machine learning method kriging. Computational and Theoretical Chemistry, 975(1):42–51, 2011.

[209] Matthew JL Mills, Glenn I Hawe, Christopher M Handley, and Paul LA Popelier. Unified approach to multipolar polarisation and charge trans- fer for ions: microhydrated na+. Physical Chemistry Chemical Physics, 15(41):18249–18261, 2013.

[210] Steven Y Liem, Majeed S Shaik, and Paul LA Popelier. Aqueous imida- zole solutions: a structural perspective from simulations with high-rank electrostatic multipole moments. The Journal of Physical Chemistry B, 115(39):11389–11398, 2011.

[211] Majeed S Shaik, Steven Y Liem, and Paul LA Popelier. Properties of liquid water from a systematic refinement of a high-rank multipolar electrostatic potential. The Journal of chemical physics, 132(17):174504, 2010.

[212] Timothy L Fletcher, Shaun M Kandathil, and Paul LA Popelier. The pre- diction of atomic kinetic energies from coordinates of surrounding atoms BIBLIOGRAPHY 365

using kriging machine learning. Theoretical Chemistry Accounts, 133(7):1– 10, 2014.

[213] Timothy J Hughes, Shaun M Kandathil, and Paul LA Popelier. Accu- rate prediction of polarised high order electrostatic interactions for hydrogen bonded complexes using the machine learning method kriging. Spectrochim- ica Acta Part A: Molecular and Biomolecular Spectroscopy, 136:32–41, 2015.

[214] Peter Maxwell, Nicodemo di Pasquale, Salvatore Cardamone, and Paul LA Popelier. The prediction of topologically partitioned intra-atomic and inter- atomic energies by the machine learning method kriging. Theoretical Chem- istry Accounts, 135(8):195, 2016.

[215] Peter Maxwell and Paul LA Popelier. Transferable atoms: an intra-atomic perspective through the study of homogeneous oligopeptides. Molecular Physics, 114(7-8):1304–1316, 2016.

[216] Stuart J Davie, Nicodemo Di Pasquale, and Paul LA Popelier. Incorporation of local structure into kriging models for the prediction of atomistic proper- ties in the water decamer. Journal of Computational Chemistry, 2016.

[217] Michel Rafat, Majeed Shaik, and Paul LA Popelier. Transferability of quan- tum topological atoms in terms of electrostatic interaction energy. The Jour- nal of Physical Chemistry A, 110(50):13578–13583, 2006.

[218] Paul LA Popelier, Enrico Clementi, Jean-Marie Andre, and J Andrew Mc- Cammon. Quantum chemical topology: knowledgeable atoms in peptides. AIP Conference Proceedings-American Institute of Physics, 1456(1):261, 2012. 366 BIBLIOGRAPHY

[219] Yongna Yuan, Matthew JL Mills, and Paul LA Popelier. Multipolar electro- statics for proteins: Atom–atom electrostatic energies in crambin. Journal of computational chemistry, 35(5):343–359, 2014.

[220] David NJ White and David H Kitson. Computational conformational anal- ysis of cyclohexaglycyl. Journal of Molecular Graphics, 4(2):112–116, 1986.

[221] Mark Lipton and W Clark Still. The multiple minimum problem in molecular modeling. tree searching internal coordinate conformational space. Journal of Computational Chemistry, 9(4):343–355, 1988.

[222] Jonathan M Goodman and W Clark Still. An unbounded systematic search of conformational space. Journal of computational chemistry, 12(9):1110– 1117, 1991.

[223] Noham Weinberg and Saul Wolfe. A comprehensive approach to the confor- mational analysis of cyclic compounds. Journal of the American Chemical Society, 116(22):9860–9868, 1994.

[224] Gordon M Crippen. Exploring the conformation space of cycloalkanes by linearized embedding. Journal of Computational Chemistry, 13(3):351–361, 1992.

[225] Catherine E Peishoff and J Scott Dixon. Improvements to the distance geometry algorithm for conformational sampling of cyclic structures. Journal of computational chemistry, 13(5):565–569, 1992.

[226] Martin Saunders. Searching for conformers of nine-to twelve-ring hydrocar- bons on the mm2 and mm3 energy surfaces: Stochastic search for inter- conversion pathways. Journal of computational chemistry, 12(6):645–663, 1991. BIBLIOGRAPHY 367

[227] David M Ferguson and Douglas J Raber. A new approach to probing confor- mational space with molecular mechanics: random incremental pulse search. Journal of the American Chemical Society, 111(12):4371–4378, 1989.

[228] Berthold von Freyberg and Werner Braun. Efficient search for all low energy conformations of polypeptides by monte carlo methods. Journal of compu- tational chemistry, 12(9):1065–1076, 1991.

[229] Yaxiong Sun and Peter A Kollman. Conformational sampling and ensemble generation by molecular dynamics simulations: 18-crown-6 as a test case. Journal of computational chemistry, 13(1):33–40, 1992.

[230] D Byrne, Jin Li, E Platt, Barry Robson, and Paul Weiner. Novel algorithms for searching conformational space. Journal of computer-aided molecular design, 8(1):67–82, 1994.

[231] Andrea Amadei, Antonius Linssen, and Herman JC Berendsen. Essential dynamics of proteins. Proteins: Structure, Function, and Bioinformatics, 17(4):412–425, 1993.

[232] Bernard Brooks and Martin Karplus. Harmonic dynamics of proteins: Nor- mal modes and fluctuations in bovine pancreatic trypsin inhibitor. Proceed- ings of the National Academy of Sciences, 80(21):6571–6575, 1983.

[233] Duvsanka Janevzivc, Richard M Venable, and Bernard R Brooks. Harmonic analysis of large systems. iii. comparison with molecular dynamics. Journal of Computational Chemistry, 16(12):1554–1566, 1995.

[234] Sanna Jaaskelainen, Chandra S Verma, Roderick E Hubbard, Pekka Lestko, and Leo SD Caves. Conformational change in the activation of lipase: An analysis in terms of low-frequency normal modes. Protein Science, 7(6):1359– 1367, 1998. 368 BIBLIOGRAPHY

[235] Liliane Mouawad and David Perahia. Motions in hemoglobin studied by nor- mal mode analysis and energy minimization: evidence for the existence of tertiary t-like, quaternary r-like intermediate structures. Journal of Molec- ular Biology, 258(2):393–410, 1996.

[236] Osni Marques and Yves-Henri Sanejouand. Hinge-bending motion in cit- rate synthase arising from normal mode calculations. Proteins: Structure, Function, and Bioinformatics, 23(4):557–560, 1995.

[237] Manel A Balsera, Willy Wriggers, Yoshitsugu Oono, and Klaus Schulten. Principal component analysis and long time protein dynamics. The Journal of Physical Chemistry, 100(7):2567–2572, 1996.

[238] M Tuckerman, Bruce J Berne, and Glenn J Martyna. Reversible mul- tiple time scale molecular dynamics. The Journal of Chemical Physics, 97(3):1990–2001, 1992.

[239] P Minary, M Tuckerman, and GJ Martyna. Long time molecular dynamics for enhanced conformational sampling in biomolecular systems. Physical Review Letters, 93(15):150201, 2004.

[240] James B Anderson. Predicting rare events in molecular dynamics. Advances in Chemical Physics, 91:381–432, 1995.

[241] Alessandro Barducci, Massimiliano Bonomi, and Michele Parrinello. Meta- dynamics. Wiley Interdisciplinary Reviews: Computational Molecular Sci- ence, 1(5):826–843, 2011.

[242] Robert E Bruccoleri and Martin Karplus. Conformational sampling us- ing high-temperature molecular dynamics. Biopolymers, 29(14):1847–1862, 1990. BIBLIOGRAPHY 369

[243] Cameron F Abrams and Eric Vanden-Eijnden. Large-scale conformational sampling of proteins using temperature-accelerated molecular dynamics. Proceedings of the National Academy of Sciences, 107(11):4961–4966, 2010.

[244] Robert E Bruccoleri and Martin Karplus. Prediction of the folding of short polypeptide segments by uniform conformational sampling. Biopolymers, 26(1):137–168, 1987.

[245] Istvan Kolossvary and Wayne C Guida. Low mode search. an efficient, au- tomated computational method for conformational analysis: Application to cyclic and acyclic alkanes and cyclic peptides. Journal of the American Chemical Society, 118(21):5011–5019, 1996.

[246] Istvan Kolossvary and Wayne C Guida. Low-mode conformational search elucidated: Application to c39h80 and flexible docking of 9-deazaguanine inhibitors into pnp. Journal of Computational Chemistry, 20(15):1671–1684, 1999.

[247] DP Landau, Shan-Ho Tsai, and M Exler. A new approach to monte carlo simulations in statistical physics: Wang-landau sampling. American Journal of Physics, 72(10):1294–1302, 2004.

[248] John Skilling et al. Nested sampling for general bayesian computation. Bayesian analysis, 1(4):833–859, 2006.

[249] Nitin Rathore and Juan J de Pablo. Monte carlo simulation of proteins through a random walk in energy space. The Journal of chemical physics, 116(16):7225–7230, 2002.

[250] Qiliang Yan, Roland Faller, and Juan J de Pablo. Density-of-states monte carlo method for simulation of fluids. The Journal of chemical physics, 116(20):8745–8749, 2002. 370 BIBLIOGRAPHY

[251] Evelina B Kim, Roland Faller, Qiliang Yan, Nicholas L Abbott, and Juan J de Pablo. Potential of mean force between a spherical particle suspended in a nematic liquid crystal and a substrate. The Journal of chemical physics, 117(16):7781–7787, 2002.

[252] Michael Patra and Mikko Karttunen. Systematic comparison of force fields for microscopic simulations of nacl in aqueous solutions: Diffusion, free energy of hydration, and structural properties. Journal of Computational Chemistry, 25(5):678–689, 2004.

[253] Sarah Rauscher, Vytautas Gapsys, Michal J Gajda, Markus Zweckstetter, Bert L de Groot, and Helmut Grubmuller. Structural ensembles of intrin- sically disordered proteins depend strongly on force field: a comparison to experiment. Journal of Chemical Theory and Computation, 11(11):5513– 5524, 2015.

[254] Aaron Sayvetz. The kinetic energy of polyatomic molecules. Journal of Chemical Physics, 7(6):383–389, 1939.

[255] Carl Eckart. Some studies concerning rotating axes and polyatomic molecules. Physical Review, 47(7):552, 1935.

[256] Joseph Ochterski. Vibrational analysis in gaussian, October 1999.

[257] Meredith JT Jordan, Keiran C Thompson, and Michael A Collins. Conver- gence of molecular potential energy surfaces by interpolation: Application to the oh+ h2 → h2o+ h reaction. Journal of Chemical Physics, 102(14):5647– 5657, 1995.

[258] HF Busnengo, A Salin, and W Dong. Representation of the 6d potential en- ergy surface for a diatomic molecule near a solid surface. Journal of Chemical Physics, 112(17):7641–7651, 2000. BIBLIOGRAPHY 371

[259] C Crespos, MA Collins, E Pijper, and GJ Kroes. Application of the modified shepard interpolation method to the determination of the potential energy surface for a molecule–surface reaction: H2+ pt (111). Journal of Chemical Physics, 120(5):2392–2404, 2004.

[260] J Toulouse, R Assaraf, and CJ Umrigar. Introduction to the variational and diffusion monte carlo methods. Arχiv 1508.02989, 2015.

[261] Robert G Gallager. Discrete stochastic processes, volume 321. Springer Science & Business Media, 2012.

[262] WA Vigor, JS Spencer, MJ Bearpark, and AJW Thom. Minimising biases in full configuration interaction quantum monte carlo. Journal of Chemical Physics, 142(10):104101, 2015.

[263] Yao-Yuan Chuang and Donald G Truhlar. Reaction-path dynamics in redun- dant internal coordinates. Journal of Physical Chemistry A, 102(1):242–247, 1998.

[264] Javier Cerezo, Jose Zuniga, Alberto Requena, Francisco J Avila Ferrer, and Fabrizio Santoro. Harmonic models in cartesian and internal coordinates to simulate the absorption spectra of carotenoids at finite temperatures. Journal of Chemical Theory and Computation, 9(11):4947–4958, 2013.

[265] Jeffrey R Reimers. A practical method for the use of curvilinear coordi- nates in calculations of normal-mode-projected displacements and duschin- sky rotation matrices for large molecules. The Journal of Chemical Physics, 115(20):9103–9109, 2001.

[266] Kai Brandhorst and Jorg Grunenberg. Efficient computation of compliance matrices in redundant internal coordinates from cartesian hessians for non- stationary points. The Journal of Chemical Physics, 132(18):184101, 2010. 372 BIBLIOGRAPHY

[267] David J Wales. Potential energy surfaces and coordinate dependence. The Journal of Chemical Physics, 113(9):3926–3927, 2000.

[268] Vebjørn Bakken and Trygve Helgaker. The efficient optimization of molec- ular geometries using redundant internal coordinates. Journal of Chemical Physics, 117(20):9160–9174, 2002.

[269] E Bright Wilson Jr. Some mathematical methods for the study of molecular vibrations. Journal of Chemical Physics, 9(1):76–84, 1941.

[270] DF McIntosh, KH Michaelian, and Mr R Peterson. A consistent deriva- tion of the wilson-decius s vectors, including new out-of-plane wag formulae. Canadian Journal of Chemistry, 56(9):1289–1295, 1978.

[271] Jon R Maple, M-J Hwang, Thomas P. Stockfisch, Uri Dinur, Marvin Wald- man, Carl S Ewig, and Arnold T. Hagler. Derivation of class ii force fields. i. methodology and quantum force field for the alkyl functional group and alkane molecules. Journal of Computational Chemistry, 15(2):162–182, 1994.

[272] Carl S Ewig, Rajiv Berry, Uri Dinur, J¨org-R¨udigerHill, Ming-Jing Hwang, Haiying Li, Chris Liang, JON Maple, Zhengwei Peng, Thomas P Stockfisch, et al. Derivation of class ii force fields. viii. derivation of a general quantum mechanical force field for organic compounds. Journal of Computational Chemistry, 22(15):1782–1800, 2001.

[273] JR Maple, Ming-Jing Hwang, Karl James Jalkanen, Thomas P Stockfisch, and Arnold T Hagler. Derivation of class ii force fields: V. quantum force field for amides, peptides, and related compounds. Journal of Computational Chemistry, 19(4):430–458, 1998. BIBLIOGRAPHY 373

[274] M-J Hwang, X Ni, M Waldman, CS Ewig, and AT Hagler. Derivation of class ii force fields. vi. carbohydrate compounds and anomeric effects. Biopoly- mers, 45(6):435–468, 1998.

[275] Zhengwei Peng, Carl S Ewig, Ming-Jing Hwang, Marvin Waldman, and Arnold T Hagler. Derivation of class ii force fields. 4. van der waals parame- ters of alkali metal cations and halide anions. Journal of Physical Chemistry A, 101(39):7243–7252, 1997.

[276] James McDonagh. hermes. https://github.com/Jammyzx1/Hermes, 2016.

[277] Yongna Yuan, Matthew JL Mills, Paul LA Popelier, and Frank Jensen. Com- prehensive analysis of energy minima of the 20 natural amino acids. Journal of Physical Chemistry A, 118(36):7876–7891, 2014.

[278] Florence Tama, Florent Xavier Gadea, Osni Marques, and Yves-Henri Sane- jouand. Building-block approach for determining low-frequency normal modes of macromolecules. Proteins: Structure, Function, and Bioinformat- ics, 41(1):1–7, 2000.

[279] B Roux and M Karplus. The normal modes of the gramicidin-a dimer chan- nel. Biophysical Journal, 53(3):297, 1988.

[280] Bernard Brooks and Martin Karplus. Normal modes for specific motions of macromolecules: application to the hinge-bending mode of lysozyme. Pro- ceedings of the National Academy of Sciences, 82(15):4995–4999, 1985.

[281] Qiang Cui, Guohui Li, Jianpeng Ma, and Martin Karplus. A normal mode analysis of structural plasticity in the biomolecular motor f 1-atpase. Journal of Molecular Biology, 340(2):345–372, 2004. 374 BIBLIOGRAPHY

[282] Jianpeng Ma and Martin Karplus. Ligand-induced conformational changes in ras p21: a normal mode and energy minimization analysis. Journal of Molecular Biology, 274(1):114–131, 1997.

[283] Alexander S Davydov. Solitons and energy transfer along protein molecules. Journal of Theoretical Biology, 66(2):379–387, 1977.

[284] Aleksandr Sergeevich Davydov. Solitons in molecular systems. Physica Scripta, 20(3-4):387, 1979.

[285] Aleksandr Sergeevich Davydov. Solitons in molecular systems. In Nonlinear and Turbulent Processes in Physics, volume 1, page 731, 1984.

[286] D Sali, Mark Bycroft, and Alan R Fersht. Stabilization of protein struc- ture by interaction of alpha-helix dipole with a charged side chain. Nature, 335(6192):740–743, 1988.

[287] H Nicholson, WJ Becktel, and BW Matthews. Enhanced protein thermosta- bility from designed mutations that interact with alpha-helix dipoles. Nature, 336:651–656, 1988.

[288] Victor Munoz, Francisco J Blanco, and Luis Serrano. The hydrophobic- staple motif and a role for loop-residues in α-helix stability and protein folding. Nature Structural & Molecular Biology, 2(5):380–385, 1995.

[289] Carmen Herrmann, Kenneth Ruud, and Markus Reiher. Can raman optical activity separate axial from local chirality? a theoretical study of helical deca-alanine. ChemPhysChem, 7(10):2189–2196, 2006.

[290] Frank M DiCapua, S Swaminathan, and David L Beveridge. Theoretical evidence for water insertion in. alpha.-helix bending: molecular dynamics of BIBLIOGRAPHY 375

gly30 and ala30 in vacuo and in solution. Journal of the American Chemical Society, 113(16):6145–6155, 1991.

[291] Adam B Birkholz and H Bernhard Schlegel. Exploration of some refine- ments to geometry optimization methods. Theoretical Chemistry Accounts, 135(4):1–12, 2016.

[292] Mari L DeMarco and Robert J Woods. Structural glycobiology: a game of snakes and ladders. Glycobiology, 18(6):426–440, 2008.

[293] Eusebio Juaristi and Gabriel Cuevas. The anomeric effect. CRC press, 1994.

[294] B Lachele Foley, Matthew B Tessier, and Robert J Woods. Carbohydrate force fields. Wiley Interdisciplinary Reviews: Computational Molecular Sci- ence, 2(4):652–697, 2012.

[295] VS Raghavendra Rao. Conformation of carbohydrates. CRC Press, 1998.

[296] Karl N Kirschner and Robert J Woods. Solvent interactions determine car- bohydrate conformation. Proceedings of the National Academy of Sciences, 98(19):10541–10545, 2001.

[297] Elisa Fadda and Robert J Woods. Molecular simulations of carbohydrates and protein–carbohydrate interactions: motivation, issues and prospects. Drug Discovery Today, 15(15):596–609, 2010.

[298] Jakub Kaminsky, Josef Kapitan, Vladimir Baumruk, Lucie Bednarova, and Petr Bour. Interpretation of raman and raman optical activity spectra of a flexible sugar derivative, the gluconic acid anion. Journal of Physical Chemistry A, 113(15):3594–3601, 2009.

[299] James R Cheeseman, Majeed S Shaik, Paul LA Popelier, and Ewan W Blanch. Calculation of raman optical activity spectra of methyl-β-d-glucose 376 BIBLIOGRAPHY

incorporating a full molecular dynamics simulation of hydration effects. Jour- nal of the American Chemical Society, 133(13):4991–4997, 2011.

[300] Karl N Kirschner, Austin B Yongye, Sarah M Tschampel, Jorge Gonz´alez- Outeiri˜no,Charlisa R Daniels, B Lachele Foley, and Robert J Woods. Gly- cam06: a generalizable biomolecular force field. carbohydrates. Journal of Computational Chemistry, 29(4):622–655, 2008.

[301] Ilian T Todorov, William Smith, Kostya Trachenko, and Martin T Dove. Dl poly 3: new dimensions in molecular dynamics simulations via massive parallelism. Journal of Materials Chemistry, 16(20):1911–1918, 2006.

[302] Amanda M Salisburg, Ashley L Deline, Katrina W Lexa, George C Shields, and Karl N Kirschner. Ramachandran-type plots for glycosidic linkages: Examples from molecular dynamic simulations using the glycam06 force field. Journal of Computational Chemistry, 30(6):910–921, 2009.

[303] Mari L DeMarco and Robert J Woods. Atomic-resolution conformational analysis of the gm3 ganglioside in a lipid bilayer and its implications for ganglioside–protein recognition at membrane surfaces. Glycobiology, 19(4):344–355, 2009.

[304] Mari L DeMarco and Robert J Woods. From agonist to antagonist: Structure and dynamics of innate immune glycoprotein md-2 upon recognition of vari- ably acylated bacterial endotoxins. Molecular Immunology, 49(1):124–133, 2011.

[305] Matthew B Tessier, Mari L DeMarco, Austin B Yongye, and Robert J Woods. Extension of the glycam06 biomolecular force field to lipids, lipid bilayers and glycolipids. Molecular Simulation, 34(4):349–364, 2008. BIBLIOGRAPHY 377

[306] Maral Basma, S Sundara, Dilek Calgan, Tereza Vernali, and Robert J Woods. Solvated ensemble averaging in the calculation of partial atomic charges. Journal of Computational Chemistry, 22(11):1125–1137, 2001.

[307] Salvatore Cardamone, Timothy J Hughes, and Paul LA Popelier. Multipolar electrostatics. Physical Chemistry Chemical Physics, 16(22):10367–10387, 2014.

[308] Ibon Alkorta and Paul LA Popelier. Computational study of mutarotation in erythrose and threose. Carbohydrate research, 346(18):2933–2939, 2011.

[309] Luis Miguel Azofra, Ibon Alkorta, Jos´eElguero, and Paul LA Popelier. Con- formational study of the open-chain and furanose structures of d-erythrose and d-threose. Carbohydrate research, 358:96–105, 2012.

[310] MJea Frisch, GW Trucks, HBea Schlegel, GE Scuseria, MA Robb, JR Cheeseman, JA Montgomery Jr, TKKN Vreven, KN Kudin, JCea Bu- rant, et al. Gaussian 03, revision c. 02; gaussian. Inc., Wallingford, CT, 4, 2004.

[311] Frank Jensen. Polarization consistent basis sets. iii. the importance of diffuse functions. The Journal of chemical physics, 117(20):9234–9240, 2002.

[312] TA Keith. Aimall (version 11.04. 03). TK Gristmill Software, 2011.

[313] Paul LA Popelier. Qctff: On the construction of a novel protein force field. International Journal of Quantum Chemistry, 115(16):1005–1011, 2015.

[314] HA Boateng and IT Todorov. Arbitrary order permanent cartesian multipo- lar electrostatic interactions. The Journal of chemical physics, 142(3):034117, 2015. 378 BIBLIOGRAPHY

[315] Michael A Collins. Molecular potential-energy surfaces for chemical reaction dynamics. Theoretical Chemistry Accounts, 108(6):313–324, 2002.

[316] Michael R Garey, David S. Johnson, and R Endre Tarjan. The planar hamiltonian circuit problem is np-complete. SIAM Journal on Computing, 5(4):704–714, 1976.

[317] Ralph E Steuer. Multiple Criteria Optimization: Theory, Computation, and Applications. Wiley, 1986.

[318] SS Ravi, Daniel J Rosenkrantz, and Giri Kumar Tayi. Facility dispersion problems: Heuristics and special cases. In Algorithms and Data Structures, pages 355–366. Springer, 1991.

[319] Gijs Rennen. Subset selection from large datasets for kriging modeling. Structural and Multidisciplinary Optimization, 38(6):545–569, 2009.

[320] Jianming Shi and Yamamoto Yoshitsugu. A dc approach to the largest empty sphere problem in higher dimensions. In State of the Art in Global Optimization, pages 395–411. Springer, 1996.

[321] Godfried T Toussaint. Computing largest empty circles with location constraints. International Journal of Computer & Information Sciences, 12(5):347–358, 1983.

[322] C Bradford Barber, David P Dobkin, and Hannu Huhdanpaa. The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software (TOMS), 22(4):469–483, 1996.

[323] David F Watson. Computing the n-dimensional delaunay tessellation with application to voronoi polytopes. The Computer Journal, 24(2):167–172, 1981. BIBLIOGRAPHY 379

[324] Tatiana I Netzeva, Andrew P Worth, Tom Aldenberg, Romualdo Benigni, Mark TD Cronin, Paola Gramatica, Joanna S Jaworska, Scott Kahn, Gilles Klopman, Carol A Marchant, et al. Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. ATLA, 33:155–173, 2005.

[325] Alexander Tropsha. Variable selection qsar modeling, model validation, and virtual screening. Annual Reports in Computational Chemistry, 2:113–126, 2006.

[326] Weida Tong, Qian Xie, Huixiao Hong, Leming Shi, Hong Fang, and Roger Perkins. Assessment of prediction confidence and domain extrapolation of two structure-activity relationship models for predicting estrogen receptor binding activity. Environmental Health Perspectives, pages 1249–1254, 2004.

[327] Faizan Sahigara, Kamel Mansouri, Davide Ballabio, Andrea Mauri, Viviana Consonni, and Roberto Todeschini. Comparison of different approaches to define the applicability domain of qsar models. Molecules, 17(5):4791–4810, 2012.

[328] Pierre Bruneau and Nathan R McElroy. Generalized fragment substructure- based property prediction method. Journal of Chemical Information Mod- elling, 46:1379–1387, 2006.

[329] Pierre Bruneau and Nathan R McElroy. logd 7.4 modeling using bayesian regularized neural networks. assessment and correction of the errors of pre- diction. Journal of Chemical Information Modelling, 46(3):1379–1387, 2006.

[330] Prasanta Chandra Mahalanobis. On the generalized distance in statistics. Proceedings of the National Institute of Sciences (Calcutta), 2:49–55, 1936. 380 BIBLIOGRAPHY

[331] Robert P Sheridan, Bradley P Feuston, Vladimir N Maiorov, and Simon K Kearsley. Similarity to molecules in the training set is a good discrimina- tor for prediction accuracy in qsar. Journal of Chemical Information and Computer Sciences, 44(6):1912–1928, 2004.

[332] Timon Sebastian Schroeter, Anton Schwaighofer, Sebastian Mika, Antonius Ter Laak, Detlev Suelzle, Ursula Ganzer, Nikolaus Heinrich, and Klaus- Robert M¨uller.Estimating the domain of applicability for machine learning qsar models: A study on aqueous solubility of drug discovery molecules. Journal of Computer-Aided Molecular Design, 21(12):651–664, 2007.

[333] Bernard W Silverman. Density Estimation for Statistics and Data Analysis, volume 26. CRC press, 1986.

[334] Weida Tong, Huixiao Hong, Hong Fang, Qian Xie, and Roger Perkins. Deci- sion forest: combining the predictions of multiple independent decision tree models. Journal of Chemical Information and Computer Sciences, 43(2):525– 531, 2003.

[335] Sabcho Dimitrov, Gergana Dimitrova, Todor Pavlov, Nadezhda Dimitrova, Grace Patlewicz, Jay Niemela, and Ovanes Mekenyan. A stepwise approach for defining the applicability domain of sar and qsar models. Journal of Chemical Information and Modeling, 45(4):839–849, 2005.

[336] Joanna Jaworska, Nina Nikolova-Jeliazkova, and Tom Aldenberg. Qsar ap- plicability domain estimation by projection of the training set descriptor space: A review. ATLA, 33(5):445, 2005.

[337] WS Massey. Cross products of vectors in higher dimensional euclidean spaces. The American Mathematical Monthly, 90(10):697–701, 1983. BIBLIOGRAPHY 381

[338] PW Atkins and LD Barron. Rayleigh scattering of polarized photons by molecules. Molecular Physics, 16(5):453–466, 1969.

[339] LD Barron and AD Buckingham. Rayleigh and raman scattering from opti- cally active molecules. Molecular Physics, 20(6):1111–1119, 1971.

[340] Lawrence D Barron. Raman optical activity of camphor and related molecules. Journal of the Chemical Society, Perkin Transactions 2, 2(8):1074–1079, 1977.

[341] LD Barron and AD Buckingham. The inertial contribution to vibrational optical activity in methyl torsion modes. Journal of the American Chemical Society, 101(8):1979–1987, 1979.

[342] Laurence D Barron and Julian Vrbancich. Temperature dependence of the low-frequency vibrational optical activity of α-phellandrene. Journal of the Chemical Society, Chemical Communications, 1(15):771–772, 1981.

[343] Lutz Hecht, Anthony L Phillips, and Laurence D Barron. Determination of enantiomeric excess using raman optical activity. Journal of Raman Spec- troscopy, 26(8-9):727–732, 1995.

[344] Ewan W Blanch, Lutz Hecht, and Laurence D Barron. Vibrational raman optical activity of proteins, nucleic acids, and viruses. Methods, 29(2):196– 209, 2003.

[345] AF Bell, L Hecht, and LD Barron. Vibrational raman optical activity of dna and rna. Journal of the American Chemical Society, 120(23):5820–5821, 1998.

[346] LD Barron, L Hecht, EW Blanch, and AF Bell. Solution structure and 382 BIBLIOGRAPHY

dynamics of biomolecules from raman optical activity. Progress in Biophysics and Molecular Biology, 73(1):1–49, 2000.

[347] Zhengshuang Shi, Robert W Woody, and Neville R Kallenbach. Is polypro- line ii a major backbone conformation in unfolded proteins? Advances in Protein Chemistry, 62:163–240, 2002.

[348] Ewan W Blanch, Lutz Hecht, Christopher D Syme, Vito Volpetti, George P Lomonossoff, Kurt Nielsen, and Laurence D Barron. Molecular struc- tures of viruses from raman optical activity. Journal of General Virology, 83(10):2593–2600, 2002.

[349] Lutz Hecht, Laurence D Barron, Ewan W Blanch, Alasdair F Bell, and Loren A Day. Raman optical activity instrument for studies of biopolymer structure and dynamics. Journal of Raman Spectroscopy, 30(9):815–825, 1999.

[350] Chien-Shiung Wu, Ernest Ambler, RW Hayward, DD Hoppes, and R Pl Hudson. Experimental test of parity conservation in beta decay. Physical Reviews, 105(4):1413, 1957.

[351] LD Barron. Parity and optical activity. Nature, 238(5358):17, 1972.

[352] LD Barron and CG Gray. The multipole interaction hamiltonian for time dependent fields. Journal of Physics A: Mathematical, Nuclear and General, 6(1):59, 1973.

[353] Max Born and Kun Huang. Dynamical theory of crystal lattices. Clarendon Press, 1954.

[354] G Placzek. Handbuch der radiologie. VI, Leipzig I, 932, 1934. BIBLIOGRAPHY 383

[355] LA Nafie and CHE Diping. Theory and measurement of raman optical activity. Advances in chemical physics, 85:105–149, 1994.

[356] PL Polavarapu. Ab initio vibrational raman and raman optical activity spectra. Journal of Physical Chemistry, 94(21):8106–8112, 1990.

[357] Garnet Kin-Lic Chan and Sandeep Sharma. The density matrix renormal- ization group in quantum chemistry. Annual Review of Physical Chemistry, 62:465–481, 2011.

[358] Laurence A Nafie. Theory of raman scattering and raman optical activity: near resonance theory and levels of approximation. Theoretical Chemistry Accounts, 119(1-3):39–55, 2008.

[359] J Tang and AC Albrecht. Developments in the theories of vibrational raman intensities. In Raman Spectroscopy, pages 33–68. Springer, 1970.

[360] Jaroslav Sebestik and Petr Bour. Raman optical activity of methyloxirane gas and liquid. The Journal of Physical Chemistry Letters, 2(5):498–502, 2011.

[361] Sandra Luber, Johannes Neugebauer, and Markus Reiher. Intensity track- ing for theoretical infrared spectroscopy of large molecules. The Journal of Chemical Physics, 130(6):064105, 2009.

[362] Sandra Luber and Markus Reiher. Intensity-carrying modes in raman and raman optical activity spectroscopy. ChemPhysChem, 10(12):2049–2057, 2009.

[363] Andrew Komornicki and James W Mciver Jr. An efficient abinitio method for computing infrared and raman intensities: Application to ethylene. The Journal of Chemical Physics, 70(4):2014–2016, 1979. 384 BIBLIOGRAPHY

[364] Michael J Frisch, Yukio Yamaguchi, Jeffrey F Gaw, Henry F SCHAEFER III, and J Stephen Binkley. Analytic raman intensities from molecular electronic wave functions. The Journal of Chemical Physics, 84(1):531–532, 1986.

[365] Roger D Amos. Electric and magnetic properties of co, hf, hci, and ch3f. Chemical Physics Letters, 87(1):23–26, 1982.

[366] Kenneth Ruud, Trygve Helgaker, and Petr Bour. Gauge-origin independent density-functional theory calculations of vibrational raman optical activity. The Journal of Physical Chemistry A, 106(32):7448–7455, 2002.

[367] Trygve Helgaker, Kenneth Ruud, Keld L Bak, Poul Jørgensen, and Jeppe Olsen. Vibrational raman optical activity calculations using london atomic orbitals. Faraday Discussions, 99:165–180, 1994.

[368] Vincent Li´egeois,Kenneth Ruud, and BenoˆıtChampagne. An analytical derivative procedure for the calculation of vibrational raman optical activity spectra. The Journal of Chemical Physics, 127(20):204105, 2007.

[369] Andreas J Thorvaldsen, Bin Gao, Kenneth Ruud, Maxim Fedorovsky, G´erardZuber, and Werner Hug. Efficient calculation of roa tensors with analytical gradients and fragmentation. Chirality, 24(12):1018, 2012.

[370] Karl J Jalkanen, RM Nieminen, M Knapp-Mohammady, and S Suhai. Vi- brational analysis of various isotopomers of l-alanyl-l-alanine in aqueous so- lution: Vibrational absorption, vibrational circular dichroism, raman, and raman optical activity spectra. International Journal of Quantum Chem- istry, 92(2):239–259, 2003.

[371] Emadeddin Tajkhorshid, KJ Jalkanen, and Sandor Suhai. Structure and vibrational spectra of the zwitterion l-alanine in the presence of explicit BIBLIOGRAPHY 385

water molecules: a density functional analysis. The Journal of Physical Chemistry B, 102(30):5899–5913, 1998.

[372] Christopher J Cramer and Donald G Truhlar. Implicit solvation models: equilibria, structure, spectra, and dynamics. Chemical Reviews, 99(8):2161– 2200, 1999.

[373] Vincenzo Barone and Maurizio Cossi. Quantum calculation of molecular energies and energy gradients in solution by a conductor solvent model. The Journal of Physical Chemistry A, 102(11):1995–2001, 1998.

[374] KJ Jalkanen, IM Degtyarenko, RM Nieminen, X Cao, LA Nafie, F Zhu, and LD Barron. Role of hydration in determining the structure and vibrational spectra of l-alanine and n-acetyl l-alanine n-methylamide in aqueous solution: a combined theoretical and experimental approach. Theoretical Chemistry Accounts, 119(1-3):191–210, 2008.

[375] Sandra Luber and Markus Reiher. Calculated raman optical activity spectra of 1, 6-anhydro-beta-d-glucopyranose. The Journal of Physical Chemistry A, 113(29):8268–8277, 2009.

[376] Kathrin H Hopmann, Kenneth Ruud, Magdalena Pecul, Andrzej Kudelski, Martin Dracinsky, and Petr Bour. Explicit versus implicit solvent modeling of raman optical activity spectra. The Journal of Physical Chemistry B, 115(14):4128–4137, 2011.

[377] Shaun T Mutter, Fran¸coisZielinski, James R Cheeseman, Christian Johan- nessen, Paul LA Popelier, and Ewan W Blanch. Conformational dynamics of carbohydrates: Raman optical activity of d-glucuronic acid and n-acetyl- d-glucosamine using a combined molecular dynamics and quantum chemical approach. Physical Chemistry Chemical Physics, 17(8):6016–6027, 2015. 386 BIBLIOGRAPHY

[378] Fran¸coisZielinski, Shaun T Mutter, Christian Johannessen, Ewan W Blanch, and Paul LA Popelier. The raman optical activity of β-d-xylose: where experiment and computation meet. Physical Chemistry Chemical Physics, 17(34):21799–21809, 2015.

[379] Ramon Carbo, Luis Leyda, and Mariano Arnau. How similar is a molecule to another? an electron density measure of similarity between two molecular structures. International Journal of Quantum Chemistry, 17(6):1185–1189, 1980.

[380] Patrick Bultinck, Xavier Girones, and Ramon Carbo-Dorca. Molecular quan- tum similarity: theory and applications. Reviews in Computational Chem- istry, 21:127, 2005.

[381] Jelle Vandenbussche, Patrick Bultinck, Anna K Przybyl, and Wouter A Herrebout. Statistical validation of absolute configuration assignment in vibrational optical activity. Journal of chemical theory and computation, 9(12):5504–5512, 2013.

[382] Elke Debie, Ewoud De Gussem, Rina K Dukor, Wouter Herrebout, Lau- rence A Nafie, and Patrick Bultinck. A confidence level algorithm for the determination of absolute configuration using vibrational circular dichroism or raman optical activity. ChemPhysChem, 12(8):1542–1549, 2011.

[383] Anthony P Scott and Leo Radom. Harmonic vibrational frequencies: an evaluation of hartree-fock, moller-plesset, quadratic configuration interac- tion, density functional theory, and semiempirical scale factors. Journal of Physical Chemistry, 100(41):16502–16513, 1996.

[384] Shigeki Yamamoto, Hitoshi Watarai, and Petr Bouˇr. Monitoring BIBLIOGRAPHY 387

the backbone conformation of valinomycin by raman optical activity. ChemPhysChem, 12(8):1509–1518, 2011.

[385] Sandra Luber and Markus Reiher. Theoretical raman optical activity study of the β domain of rat metallothionein. The Journal of Physical Chemistry B, 114(2):1057–1063, 2009.

[386] Petr Bour, Jana Sopkova, Lucie Bednarova, Petr Malon, and Timothy A Keiderling. Transfer of molecular property tensors in cartesian coordinates: a new algorithm for simulation of vibrational spectra. Journal of Computa- tional Chemistry, 18(5):646–659, 1997.

[387] Valery Andrushchenko and Petr Bour. Applications of the cartesian coor- dinate tensor transfer technique in the simulations of vibrational circular dichroism spectra of oligonucleotides. Chirality, 22(1E):E96–E114, 2010.

[388] Jan Kubelka. Vibrational spectroscopic studies of peptide and protein struc- tures: Theory and experiment. PhD thesis, The University of Illinois at Chicago, 2002.

[389] Shigeki Yamamoto, Jakub Kaminsky, and Petr Bour. Structure and vi- brational motion of insulin from raman optical activity spectra. Analytical Chemistry, 84(5):2440–2451, 2012.

[390] Noah S Bieler, Moritz P Haag, Christoph R Jacob, and Markus Reiher. Analysis of the cartesian tensor transfer method for calculating vibrational spectra of polypeptides. Journal of Chemical Theory and Computation, 7(6):1867–1881, 2011.

[391] Shigeki Yamamoto and Petr Bour. On the limited precision of transfer of molecular optical activity tensors. Collection of Czechoslovak Chemical Com- munications, 76(5):567–583, 2011. 388 BIBLIOGRAPHY

[392] Shigeki Yamamoto, Xiaojun Li, Kenneth Ruud, and Petr Bour. Transferabil- ity of various molecular property tensors in vibrational spectroscopy. Journal of Chemical Theory and Computation, 8(3):977–985, 2012.

[393] Jiri Kessler, Josef Kapitn, and Petr Bour. First-principles predictions of vibrational raman optical activity of globular proteins. Journal of Physical Chemistry Letters, 6(16):3314–3319, 2015.

[394] Matthew J Betts and Robert B Russell. Amino acid properties and conse- quences of substitutions. Bioinformatics for Geneticists, 317:289, 2003.

[395] Cassandra DM Churchill and Stacey D Wetmore. Noncovalent interactions involving histidine: the effect of charge on π- π stacking and t-shaped in- teractions with the dna nucleobases. The Journal of Physical Chemistry B, 113(49):16046–16058, 2009.

[396] Paul Carter and James A Wells. Dissecting the catalytic triad of a serine protease. Nature, 332(6164):564–568, 1988.

[397] Sergio Manzetti, Daniel R McCulloch, Adrian C Herington, and David van der Spoel. Modeling of enzyme–substrate complexes for the metallopro- teases mmp-3, adam-9 and adam-10. Journal of Computer-Aided Molecular Design, 17(9):551–565, 2003.

[398] Stefania Nobili, Enrico Mini, Ida Landini, Chiara Gabbiani, Angela Casini, and Luigi Messori. Gold compounds as anticancer agents: chemistry, cel- lular pharmacology, and preclinical studies. Medicinal Research Reviews, 30(3):550–580, 2010.

[399] George B Arfken, Hans J Weber, and Frank E Harris. Mathematical Methods for Physicists: A Comprehensive Guide. Academic Press, 2011. BIBLIOGRAPHY 389

[400] Wolfgang Kabsch. A solution for the best rotation to relate two sets of vec- tors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theo- retical and General Crystallography, 32(5):922–923, 1976.

[401] Wolfgang Kabsch. A discussion of the solution for the best rotation to re- late two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography, 34(5):827–828, 1978.

[402] Evangelos A Coutsias, Chaok Seok, and Ken A Dill. Using quaternions to calculate rmsd. Journal of Computational Chemistry, 25(15):1849–1857, 2004.

[403] Stefan Wallin, Jochen Farwer, and Ugo Bastolla. Testing similarity measures with continuous and discrete protein models. Proteins: Structure, Function, and Bioinformatics, 50(1):144–157, 2003.

[404] Berthold KP Horn, Hugh M Hilden, and Shahriar Negahdaripour. Closed- form solution of absolute orientation using orthonormal matrices. JOSA A, 5(7):1127–1135, 1988.

[405] Berk Hess, Carsten Kutzner, David Van Der Spoel, and Erik Lindahl. Gro- macs 4: algorithms for highly efficient, load-balanced, and scalable molecu- lar simulation. Journal of Chemical Theory and Computation, 4(3):435–447, 2008.

[406] George A Kaminski, Richard A Friesner, Julian Tirado-Rives, and William L Jorgensen. Evaluation and reparametrization of the opls-aa force field for proteins via comparison with accurate quantum chemical calculations on peptides. The Journal of Physical Chemistry B, 105(28):6474–6487, 2001.

[407] Evelyne Deplazes, Wilhelm Van Bronswijk, F Zhu, LD Barron, S Ma, LA Nafie, and KJ Jalkanen. A combined theoretical and experimental study 390 BIBLIOGRAPHY

of the structure and vibrational absorption, vibrational circular dichroism, raman and raman optical activity spectra of the l-histidine zwitterion. The- oretical Chemistry Accounts, 119(1-3):155–176, 2008.

[408] Zhijian Huang, Wenbo Yu, and Zijing Lin. First-principle studies of gaseous aromatic amino acid histidine. Journal of Molecular Structure: THEOCHEM, 801(1):7–20, 2006.

[409] Thom Vreven, K Suzie Byun, Istvan Komaromi, Stefan Dapprich, John A Montgomery Jr, Keiji Morokuma, and Michael J Frisch. Combining quantum mechanics methods with molecular mechanics methods in oniom. Journal of Chemical Theory and Computation, 2(3):815–826, 2006.

[410] Kenneth Ruud and Andreas J Thorvaldsen. Theoretical approaches to the calculation of raman optical activity spectra. Chirality, 21(1E):E54–E67, 2009.

[411] Robert Ditchfield. Self-consistent perturbation theory of diamagnetism: I. a gauge-invariant lcao method for nmr chemical shifts. Molecular Physics, 27(4):789–807, 1974.

[412] James R Cheeseman and Michael J Frisch. Basis set dependence of vibra- tional raman and raman optical activity intensities. Journal of Chemical Theory and Computation, 7(10):3323–3334, 2011.

[413] Gerard Zuber and Werner Hug. Rarefied basis sets for the calculation of optical tensors. 1. the importance of gradients on hydrogen atoms for the raman scattering tensor. Journal of Physical Chemistry A, 108(11):2108– 2118, 2004.

[414] Vebjorn Bakken, John M Millam, and H Bernhard Schlegel. Ab initio clas- sical trajectories on the born–oppenheimer surface: Updating methods for BIBLIOGRAPHY 391

hessian-based integrators. The Journal of Chemical Physics, 111(19):8773– 8777, 1999.

[415] Chunyang Peng, Philippe Y Ayala, H Bernhard Schlegel, and Michael J Frisch. Using redundant internal coordinates to optimize equilibrium geome- tries and transition states. Journal of Computational Chemistry, 17(1):49– 56, 1996.

[416] Suk Kyoung Lee, Wen Li, and H Bernhard Schlegel. Hco+ dissociation in a strong laser field: An ab initio classical trajectory study. Chemical Physics Letters, 536:14–18, 2012.

[417] Adam B Birkholz and H Bernhard Schlegel. Using bonding to guide transi- tion state optimization. Journal of Computational Chemistry, 36(15):1157– 1166, 2015.

[418] Ikuo Ashikawa and Koichi Itoh. Raman spectra of polypeptides containing l-histidine residues and tautomerism of imidazole side chain. Biopolymers, 18(8):1859–1876, 1979.

[419] Akira Toyama, Kunio Ono, Shinji Hashimoto, and Hideo Takeuchi. Raman spectra and normal coordinate analysis of the n1-h and n3-h tautomers of 4- methylimidazole: vibrational modes of histidine tautomer markers. Journal of Physical Chemistry A, 106(14):3403–3412, 2002.

[420] Ol’ha O Brovarets, Roman O Zhurakivsky, and Dmytro M Hovorun. Is the dpt tautomerization of the long ag watson–crick dna base mispair a source of the adenine and guanine mutagenic tautomers? a qm and qtaim response to the biologically important question. Journal of computational chemistry, 35(6):451–466, 2014. 392 BIBLIOGRAPHY

[421] Ol’ha O Brovarets and Dmytro M Hovorun. How does the long g-g* watson- crick dna base mispair comprising keto and enol tautomers of the guanine tautomerise? the results of a qm/qtaim investigation. Physical chemistry chemical physics: PCCP, 16(30):15886–15899, 2014.

[422] Ilian T Todorov, William Smith, Kostya Trachenko, and Martin T Dove. Dl poly 3: new dimensions in molecular dynamics simulations via massive parallelism. Journal of Materials Chemistry, 16(20):1911–1918, 2006.

[423] John A Baker and Jonathan D Hirst. Molecular dynamics simulations using graphics processing units. Molecular Informatics, 30(6-7):498–504, 2011.

[424] Sadaf R Alam, Pratul K Agarwal, Melissa C Smith, Jeffrey S Vetter, and David Caliga. Using fpga devices to accelerate biomolecular simulations. Computer, 40(3):66–73, 2007.

[425] David E Shaw, Martin M Deneroff, Ron O Dror, Jeffrey S Kuskin, Richard H Larson, John K Salmon, Cliff Young, Brannon Batson, Kevin J Bowers, Jack C Chao, et al. Anton, a special-purpose machine for molecular dynamics simulation. Communications of the ACM, 51(7):91–97, 2008.

[426] Noel Cressie. Spatial prediction and ordinary kriging. Mathematical geology, 20(4):405–421, 1988.

[427] Alexander I J. Forrester, Andy J Keane, and Neil W Bressloff. Design and analysis of” noisy” computer experiments. AIAA journal, 44(10):2331–2339, 2006.

[428] Carlos Henrique Grohmann and SS Steiner. Srtm resample with short distance-low nugget kriging. International Journal of Geographical Infor- mation Science, 22(8):895–906, 2008. BIBLIOGRAPHY 393

[429] Dale Zimmerman, Claire Pavlik, Amy Ruggles, and Marc P Armstrong. An experimental comparison of ordinary and universal kriging and inverse distance weighting. Mathematical Geology, 31(4):375–390, 1999.

[430] Timothy L Fletcher and Paul LA Popelier. Transferable kriging machine learning models for the multipolar electrostatics of helical deca-alanine. The- oretical Chemistry Accounts, 134(11):1–16, 2015.

[431] Adri CT Van Duin, Siddharth Dasgupta, Francois Lorant, and William A Goddard. Reaxff: a reactive force field for hydrocarbons. The Journal of Physical Chemistry A, 105(41):9396–9409, 2001.

[432] Peter Maxwell, Angel Martin Pendas, and Paul LA Popelier. Extension of the interacting quantum atoms (iqa) approach to b3lyp level density functional theory (dft). Physical Chemistry Chemical Physics, 2016.

[433] Graeme Henkelman, Andri Arnaldsson, and Hannes J´onsson. A fast and robust algorithm for bader decomposition of charge density. Computational Materials Science, 36(3):354–360, 2006.

[434] Ian Bytheway, Ronald J Gillespie, Ting-Hua Tang, and Richard FW Bader. Core distortions and geometries of the difluorides and dihydrides of ca, sr, and ba. Inorganic Chemistry, 34(9):2407–2414, 1995.

[435] Enrico Fermi. Un metodo statistico per la determinazione di alcune priorieta dell’atome. Rend. Accad. Naz. Lincei, 6(602-607):32, 1927.

[436] Elliott H Lieb and Barry Simon. The thomas-fermi theory of atoms, molecules and solids. Advances in Mathematics, 23(1):22–116, 1977.

[437] John Clarke Slater. The self-consistent field for molecules and solids, vol- ume 4. McGraw-Hill New York, 1974. 394 BIBLIOGRAPHY

[438] Erich Runge and Eberhard KU Gross. Density-functional theory for time- dependent systems. Physical Review Letters, 52(12):997, 1984.

[439] Pierre Hohenberg and Walter Kohn. Inhomogeneous electron gas. Physical review, 136(3B):B864, 1964.

[440] Mel Levy. Electron densities in search of hamiltonians. Physical Review A, 26(3):1200, 1982.

[441] Yan Zhao and Donald G Truhlar. The m06 suite of density functionals for main group thermochemistry, thermochemical kinetics, noncovalent inter- actions, excited states, and transition elements: two new functionals and systematic testing of four m06-class functionals and 12 other functionals. Theoretical Chemistry Accounts, 120(1-3):215–241, 2008. Appendix A

Density Functional Theory

A.1 Thomas-Fermi Model

The vast majority of (time-independent) ab initio methods use the molecular wave- function of an N electron system, Ψ(r1, ... , rN ) as a central quality. When one has an appropriate wavefunction, expectation values of operator quantities associated with physical observables can be evaluated. As opposed to working with the wave- function of a system, the Thomas-Fermi model[435] uses the electron density, ρ(r), as a central quantity. The Thomas-Fermi energy of a system, ETF [ρ(r)], is a func- tional of the electron density, and can be written

Z Z 1 Z Z ρ(r)ρ(r0) E [ρ(r)] = C drρ(r)5/3 + drρ(r)v(r) + dr0 dr . (A.1.1) TF F 2 ||r − r0||

The first term of this quantity corresponds to the kinetic energy of a homogeneous electron gas, the second term is the Coulombic attraction between electrons and nuclei and the third term is the Coulombic electron-electron repulsion. v(r), the

395 396 APPENDIX A. DENSITY FUNCTIONAL THEORY

“external potential”, has analytical form

NN X ZI v(r) = − , (A.1.2) |r − RI | I=1

th where RI is the position vector of the I nucleus and ZI its atomic number. Note that we have treated the system classically, and so have no way to ac- count for the quantum mechanical nature of the electrons, e.g. fermion exchange. We seek the electron density which minimises the Thomas-Fermi energy given in (A.1.1), subject to the constraint that the number of electrons remains constant, R i.e. drρ(r) = Ne. The minimisation of a quantity subject to a constraint is readily achieved by use of Lagrange multipliers, λ, " # δE [ρ(r)] δ Z TF − λ drρ(r) − N = 0 . (A.1.3) δρ(r) δρ(r) e

Taking the functional derivative of each term in the Thomas-Fermi energy with respect to the electron density is straightforward, and combining with (A.1.3) yields 5 Z ρ(r0) C ρ(r)2/3 + v(r) + dr0 − λ = 0 , (A.1.4) 3 F |r − r0| and is readily solved to find the electron density of the minimum energy state of the system. Unfortunately, the Thomas-Fermi formalism is a poor representation of a quantum mechanical system, owing to the classical description we have invoked. In addition, the approximation of the kinetic energy as in (A.1.1) is particularly crude. As a result, the Thomas-Fermi model fails to predict even qualitatively correct results[436]. For instance, the Thomas-Fermi model is unable to predict atomic bonding, and subsequently the majority of chemistry. Xα-theory[437] was a later correction to the Thomas-Fermi model, but was also largely of little use. A.2. HOHENBERG-KOHN THEOREMS 397 A.2 Hohenberg-Kohn Theorems

A.2.1 First Theorem

Given a normalised wavefunction associated with a given external potential, ψv, which satisfies the Schr¨odingerequation

Hvψv = Evψv , (A.2.1) where the Hamiltonian operator can be further decomposed into a sum of kinetic and potential energy operators, as well as the external potential, H = T + U + v(r). We assume there are two potentials, va and vb, with an identical wavefunction (to within some phase factor, eiφ), such that

aψa = Eaψa H iφ ψa = e ψb . (A.2.2) Hbψb = Ebψb

Multiplying through the second equation by eiφ, and solving the equations simul- taneously,       Ha − Hb ψa = va(r) − vb(r) ψa = Ea − Eb ψa , (A.2.3) leaving us with the relation

va(r) − vb(r) = Ea − Eb = c , (A.2.4) where c is a constant. The above implies that wavefunctions differing by a phase factor have potentials that differ by a constant term,

iφ ψa = e ψb ⇐⇒ va(r) = vb(r) + c , (A.2.5) and the reverse holds (potentials differing by a constant term have wavefunctions differing only by a phase factor), hence the relationship is bijective. 398 APPENDIX A. DENSITY FUNCTIONAL THEORY

Now, consider the two wavefunctions are associated with different potentials, ψa and ψb. By use of the variational principle, Z Ea < hψb| Ha |ψbi = hψb| T + U + va(r) |ψbi = E + drρb(r)va(r) Z Z Z < hψb| T + U |ψbi + drρb(r)vb(r) − drρb(r)vb(r) + drρb(r)va(r) Z h i < Eb + dr va(r) − vb(r) ρb(r) (A.2.6) Z h i Eb < Ea + dr vb(r) − va(r) ρa(r) . (A.2.7)

Adding the two inequalities, we arrive at Z h i Z h i Ea +Eb < Ea +Eb + dr va(r)−vb(r) ρb(r)+ dr vb(r)−va(r) ρa(r) , (A.2.8) which, if ρa(r) = ρb(r), leads to

Ea + Eb < Ea + Eb , (A.2.9) which is obviously absurd. As such, to satisfy the above, we require that two distinct potentials, up to an additive constant, lead to two distinct densities, i.e.

ρa(r) 6= ρb(r) ⇐⇒ va(r) 6= vb(r) + c , (A.2.10) which is the Hohenberg-Kohn first theorem. As an aside, Runge and Gross[438] provided the rather extraordinary result that for the time-dependent case, the above only requires a slight modification to

ρa(r, t) 6= ρb(r, t) ⇐⇒ va(r, t) 6= vb(r, t) + c(t) . (A.2.11)

A.2.2 Second Theorem

The second Hohenberg-Kohn theorem specifies that the ground state energy of a system can be obtained through the variation of a trial density[439]. This varia- A.2. HOHENBERG-KOHN THEOREMS 399 tional principle is analogous to the conventional Rayleigh-Ritz variational princi- ple, where the ground state energy of a system is obtained through variation of a trial wavefunction,

E0 = min hΨ| H |Ψi . (A.2.12) Ψ The minimisation with respect to the trial density is accomplished by the con- strained two-step search methodology of Levy and Lieb[440], where the electron density, ρ(r) is kept constant in the first step. We denote a trial wavefunction, Ψ{α}, where {α} denote the set of variational parameters to optimise. We begin writing the minimisation as,

{α} {α} E0[ρ(r)] = min hΨ | H |Ψ i {α}

= min hΨ{α}| T + U + v(r) |Ψ{α}i (A.2.13) {α} Z = F [ρ(r)] + drv(r)ρ(r) , where the universal function, F [ρ(r)] is defined as

F [ρ(r)] = min hΨ{α}| T + U |Ψ{α}i , (A.2.14) {α} and optimising the universal functional with respect to {α} is the first step of the constrained optimisation. The second step of the optimisation involves the minimisation with respect to the density, such that

" Z # E0[ρ(r)] = min F [ρ(r)] + drv(r)ρ(r) , (A.2.15) ρ(r) and we arrive at the Hohenberg-Kohn second theorem, which proposes the exis- tence of a variational principle with respect to the density. 400 APPENDIX A. DENSITY FUNCTIONAL THEORY A.3 Kohn-Sham Equations

We consider the universal functional of (A.2.14), F [ρ(r)], where the potential energy operator has been split into two separate components

F [ρ(r)] = T [ρ(r)] + U [ρ(r)] Z Z 0 1 0 ρ(r)ρ(r ) = T [ρ(r)] + dr dr + Exc[ρ(r)] (A.3.1) 2 ||r − r0||

0 = T [ρ(r)] + Vee[ρ(r), ρ(r )] + Exc[ρ(r)] , and we hope to capture the exchange-correlation energetics that are omitted in mean-field approaches (such as Hartree-Fock) in the exchange-correlation energy,

Exc[ρ(r)]. The total energy of the system is written as Z Z Z 0 1 0 ρ(r)ρ(r ) E[ρ(r)] = drv(r)ρ(r) + T [ρ(r)] + dr dr + Exc[ρ(r)] . (A.3.2) 2 ||r − r0||

The forms of T [ρ(r)] and Exc[ρ(r)], the kinetic and exchange-correlation energies 0 respectively, are in general unknown. Vee[ρ(r), ρ(r )] is the familiar electrostatic interaction between two charge densities. Now, by use of the constraint that the number of electrons, R drρ(r) = N is a conserved quantity, we are able to form a Lagrangian of the total system energy as given in (A.2.15), Z Z ! L = F [ρ(r)] + drv(r)ρ(r) − λ drρ(r) − N , (A.3.3) where λ is a Lagrange multiplier. By the second Hohenberg-Kohn theorem, we are able to take the functional derivative of this quantity to optimise the electron density, " Z Z !# δL = δ drv(r)ρ(r) + F [ρ(r)] − λ drρ(r) − N = 0

δT [ρ(r)] 1 Z ρ(r)ρ(r0) = v(r) + + dr0 + E [ρ(r)] − λ = 0 (A.3.4) δρ(r) 2 ||r − r0|| xc δT [ρ(r)] = + V [ρ(r)] − λ = 0 . δρ(r) KS A.3. KOHN-SHAM EQUATIONS 401

In the above, we have introduced a number of shorthand notations. The exchange- correlation potential, Vxc[ρ(r)], is defined as

δE [ρ(r)] V [ρ(r)] = xc , (A.3.5) xc δρ(r) and the Kohn-Sham potential, VKS[ρ(r)] is given by

1 Z ρ(r0) V [ρ(r)] = v(r) + dr0 + V [ρ(r)] . (A.3.6) KS 2 ||r − r0|| xc

From (A.3.4), we see that the final equation resembles a system of non-interacting particles in an external potential, VKS[ρ(r)], and so the ground state electron density is given by solving a series of one-electron Schr¨odingerequations. Recall that the electron density is given by

N X ∗ ρ(r) = fiψi (r)ψi(r) , (A.3.7) i=1

th where fi is the filling of the i orbital, ψi(r). The kinetic energy functional is then typically written as

N Z 1 X ∗ 2 T [ρ(r)] = − fi drψ (r)∇ ψi(r) . (A.3.8) 2 i i=1 As such, we obtain the Kohn-Sham equations " # 1 − ∇2 + V (r) ψ (r) =  ψ (r) , (A.3.9) 2 KS i i i forming an eigensystem, and can be solved by a myriad of techniques.

It has been proved that the exchange-correlation energy, Exc, has an exact analyt- ical form. However, the exact form is unknown, and so while DFT is a formally exact theory, the necessity of approximating Exc renders it inexact. A number of strategies exist for approximating Exc, which we briefly outline. By invoking 402 APPENDIX A. DENSITY FUNCTIONAL THEORY

the Local Density Approximation (LDA), we approximate Exc as the integral of a functional of the electron density over all space, Z LDA Exc [ρ(r)] = drfxc[ρ(r)] , (A.3.10)

where fxc denotes a general functional of the electron density. One can invoke more accurate, and consequently more time-consuming, functionals by the inclu- sion of higher order derivatives of the electron density. By invoking the so-called

“Generalised Gradient Approximation” (GGA), Exc is written as a functional of both the electron density and its spatial derivatives

GGA Exc [ρ(r), ∇ρ(r)] , (A.3.11) and can vastly improve upon the accuracy offered by the LDA. As a natural extension to the GGA functionals, the meta-GGA functionals (m-GGA) (such as the popular Minnesota functionals[441]) include second order spatial derivatives of the electron density,

m−GGA 2 Exc [ρ(r), ∇ρ(r), ∇ ρ(r)] . (A.3.12)

One can also generalise each of the functionals to account for electron spins. Var- ious combinations of the above functional types also exist, and are referred to as hybrid functionals, such as the common B3LYP functional. Appendix B

Storage of Kriging Models

In Section 2.4.2, we saw that the prediction of the response at a point, y(x∗), is given by (2.4.11). Thus, the prediction of a response requires a priori knowledge of, R−1, θˆ, pˆ,µ ˆ,σ ˆ2, y and r. Assuming that all of this data is stored in double precision ASCII format, for a training set size of 1000 points, where the inputs are 60−dimensional (e.g. a single amino acid), we would require

Term Number of Elements Memory Required (bytes)

R−1 n × n 8n2

θˆ d 8d

pˆ d 8d

µˆ andσ ˆ2 2 16

y n 8n

r n 8n

8n(n + 2) + 16(d + 1)

403 404 APPENDIX B. STORAGE OF KRIGING MODELS

Since n  d (typically), we find that our storage requirements scales as roughly O(n2), so that for a system with n = 1000, we require approximately 8 MB of storage. As discussed in Section 2.5.2, we require 28 kriging models for each atom type. As such, each atom type will require approximately 28 × 8 = 224 MB of memory. As an (admittedly crude) estimation for the number of atom types required, the first parameterisation of the amber force field possessed 56 distinct atom types. Accounting for each of these atom types would require 224×56 ≈ 12.5 GB of memory.

Whilst certainly within the realms of what is feasible given modern hardware, having to store 12.5 GB of data in RAM is not an ideal situation. One could sequentially load the kriging models into RAM over the course of an MD simulation when certain atom types are encountered during the energy evaluation stage (which is amenable to massively parallel calculations), one would subsequently be crippled by I/O, and so we would ideally like to compress the data in the files as much as possible.

We primarily notice that since R is symmetric, so is its inverse. As such, we need only store the upper triangular form of R−1, which immediately reduces our stor- age costs from 8n2 to 4n(n+1). Taking this into account, we reduce our estimated memory requirements from 12.5 GB to roughly 6.3 GB. Whilst this necessitates a computational overhead of symmetrising R−1 when it is required, this can be min- imised with matrix multiplication algorithms for symmetric matrices. A further immediate saving can be made based on kriging being a stochastic process. As such, one can argue that double precision accuracy is not necessary. Use of single precision floats reduces our memory costs by a further factor of two, leading to a requirement of roughly 3.2 GB of storage space.

Our next memory saving comes from a prudent selection of training points for 405 each kriging model. In doing so, the number of training points required can be decreased significantly from our earlier estimate of 1000 training points. Speaking somewhat optimistically, we can approximate the number of training examples required to roughly 600. This reduces our memory cost to roughly 2.0 GB.

So far, we have assumed that all of our data is stored in ASCII format. In RAM, data is stored in binary form, which constitutes a threefold reduction in memory requirements. We can pass on this threefold reduction to our hard disk storage re- quirements as well by storing all of our data in binary format. In fortran90/95, the default write to file is record-based, i.e. for each string of double precision numbers stored, 4 bytes are placed on either side of the string, containing the size of the string.

Stream-based writing is a fortran2003 functionality, and removes the buffers of record-based writing. So, for a string of three double precision values, 24 bytes + 8 bytes = 32 bytes of memory are required with record-based writing, in contrast to 24 bytes required for stream-based I/O. By use of stream-based I/O, we are able to reduce the hard disk cost from 2.0 GB to just over 700 MB. This quantity is certainly feasible given modern computational resources. Addendum

406 PCCP

View Article Online PERSPECTIVE View Journal | View Issue

Multipolar electrostatics

ab ab ab Cite this: Phys. Chem. Chem. Phys., Salvatore Cardamone, Timothy J. Hughes and Paul L. A. Popelier* 2014, 16,10367 Atomistic simulation of chemical systems is currently limited by the elementary description of electrostatics that atomic point-charges offer. Unfortunately, a model of one point-charge for each atom fails to capture the anisotropic nature of electronic features such as lone pairs or p-systems. Higher order electrostatic terms, such as those offered by a multipole moment expansion, naturally recover these important electronic features. The question remains as to why such a description has not yet been widely adopted by popular molecular mechanics force fields. There are two widely-held misconceptions about the more rigorous formalism of multipolar electrostatics: (1) Accuracy: the implementation of multipole moments, compared to point-charges, offers little to no advantage in terms of an accurate representation of a system’s energetics, structure and dynamics. (2) Efficiency: atomistic simulation using multipole moments is Received 26th November 2013, computationally prohibitive compared to simulation using point-charges. Whilst the second of these may Accepted 5th April 2014 have found some basis when computational power was a limiting factor, the first has no theoretical DOI: 10.1039/c3cp54829e grounding. In the current work, we disprove the two statements above and systematically demonstrate that multipole moments are not discredited by either. We hope that this perspective will help in catalysing the www.rsc.org/pccp transition to more realistic electrostatic modelling, to be adopted by popular molecular simulation software.

1. Introduction then how to best represent an atom such that it interacts with other atoms in a realistic manner. A convenient and trust- Atomistic simulations of large systems over long time scales worthy way to answer this question is to start from the electron can only be achieved by using energy potentials, rather than by density, because from the first Hohenberg–Kohn theorem we solving the Schro¨dinger equation on-the-fly. The question is know that a system’s total energy can be obtained just from its electron density. The original question can then be rephrased

a as to how one should represent the electron density of a given Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. Manchester Institute of Biotechnology (MIB), 131 Princess Street, Manchester, atom while it is part of a system. Surprisingly, the current and M1 7DN, UK. E-mail: [email protected] b School of Chemistry, University of Manchester, Oxford Road, Manchester, predominant view is to think of an atom in a system as being M13 9PL, UK spherical. This picture corresponds to representing the atomic

Salvatore Cardamone obtained a Tim Hughes graduated with a 1st 1st class BSc in biochemistry from class BSc (chemistry) from The the University of Sheffield. He has University of Manchester in 2012. recently moved to the University of He is currently working towards his Manchester to complete a PhD in PhD in computational chemistry theoretical chemistry under the under Prof. Popelier developing supervision of Prof. Popelier. His the novel ‘‘Quantum Chemical research focuses on structural Topological Force Field’’ sampling of molecular species and (QCTFF), with particular interest the parameterisation of a novel in the non-covalent interactions force field for use with carbo- between molecules. In his spare hydrates. Other research interests time he enjoys playing the drums Salvatore Cardamone include mathematical formalism Timothy J. Hughes and engaging in team sports such of physical systems. Outside of as football. academia, Salvatore is a com- petitive ballroom and latin dancer.

This journal is © the Owner Societies 2014 Phys. Chem. Chem. Phys., 2014, 16,10367--10387| 10367 View Article Online

Perspective PCCP

medium-strength hydrogen bonds can be properly described by the electrostatic interaction. In summary, a point-charge is spherically symmetric (or isotropic) and therefore it has no directional preference while interacting with another point-charge. However, a point multipole moment on a nuclear site prefers another point multipole to be oriented in a certain way, in order to lower the multipole–multipole interaction energy (for examples see Fig. 1 A schematic geometry of the global minimum of the water dimer Fig. 3.2, 3.3 and 3.4 in ref. 3). This anisotropy makes a multi- obtained by a point-charge model (a B 251, orange) and an ab initio pole moment directional. calculation (a B 451, green). This perspective brings together contributions that high- light the shortcomings of point-charges. It will argue, based on clear and consistent evidence, that the point-charge model is electrostatic potential as being generated by an atomic point- inherently limited in terms of accuracy, provided one intro- charge. This means that a single number (the point-charge) is duces only one charge per atom. If more off-nuclear charges are associated with the atom’s nucleus while assuming that this introduced then the accuracy improves but this perspective number summarises the complexity of the atomic electron focuses on a mathematically more elegant solution, which is density sufficiently well in order to predict its electrostatic that of multipole moments. As this perspective delivers the interaction behaviour. evidence for the superiority of multipole moments over point- A simple example shows that this view cannot be right. In charges it aspires that the status quo of the use of point-charges Fig. 1 we consider the global energy minimum of the water will change. dimer.1 This case serves to illustrate an essential argument that We can ask, however, if the inadequacy of current point- also applies to hydrogen bonding in general, p–p stacking and charge force fields actually matters over very long time scales, halogen bonding, which will be discussed in detail much later. when energy errors can perhaps cancel each other. Might these A typical ab initio calculation on the water dimer will produce errors become irrelevant fluctuations drowning in the large a ‘‘flap angle’’ a of about 451, while a point-charge model scale (space and time) phenomena the molecular simulator a (i.e. single charge for each atom) will generate an angle about is interested in? Apparently not, if one reads a very recent 1 25 . Disregarding the irrelevant details of the level of theory statement published4 in the Conclusions of 100 ms molecular used or the exact nature of this point charge-model,2 it is clear dynamics simulations on 24 proteins. For most of the 24 that the latter cannot predict the required tilt in the water proteins studied, the simulations drifted away from their native molecule at the right hand side. Only when extra off-nuclear structure (initiated from homology models). The authors stated point-charges are added to oxygen does the geometry prediction that ‘‘In our view, it is probably more beneficial in the long run improve. Equally, if point multipole moments are added to to focus on the development of better force fields than on the the oxygen then the prediction improves substantially. This development of sophisticated methodologies for scoring structures example shows that the currently ubiquitous treatment of realized in simulation’’. Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. electrostatic interaction cannot be correct. This simply case Multipole moments have been introduced decades ago in study is relevant because it is generally acknowledged that the field of atomistic energy potentials but they are still not part of the mainstream theoretical treatment of electrostatic inter- action, its applications nor concomitant software. Yet, multipole moments arise naturally and rigorously in the treatment of Paul Popelier was educated in interactions governed by an inverse distance (or 1/r) depen- Flanders up to PhD level and is dence. In the following we focus on the essence of what a a full Professor of Chemical multipole expansion achieves while omitting mathematical Computation. He has published details that can be found elsewhere.3,5,6 Fig. 2 schematically more than 170 contributions, shows two interacting charge distributions (left and right). To containing 3 books, a commer- simplify matters we put the origin inside the left charge cially released computer program, distribution and set the origin of the right distribution at R. and 25 single-author items. He is The position vector r describes the left distribution by sweeping an EPSRC Established Career its volume, while the vector r0 does the same for the right Fellow and a Fellow of the Royal distribution while being based at R. Society of Chemistry. Currently his One can think of an infinitesimal bit of electronic charge group mainly develops a novel density, located at r, interacting with an infinitesimal bit of 0 Paul L. A. Popelier force field based on Quantum charge density located at r + R. These two interacting charge Chemical Topology (QCT). In his bits are separated by a distance |r (r0 + R)|. If one wants to sparetimeheplaysmodernjazz know the total interaction between the two charge densities piano and composes. then one needs to sum over all the possible pairs of interacting

10368 | Phys. Chem. Chem. Phys., 2014, 16,10367--10387 This journal is © the Owner Societies 2014 View Article Online

PCCP Perspective

moments are centred. The lowest possible rank of L is 1, which corresponds to the interaction between point-charges, which is the longest possible range of electrostatic interaction. It should not come as a surprise that truncating the multipolar series already at the very first term (L = 1) harms the proper descrip- tion of the intricacy of a given electron distribution. In summary, many mathematical details and technical issues have been omitted in order to focus on the main points. 7 Fig. 2 A schematic representation of the Coulomb interaction between Interested readers can find them in a recent review. However, two charge distributions (left and right). Each distribution is described by a this discussion has set the scene and simply introduced a few 0 vector (r or r + R) that sweeps the volume of the respective distribution. important concepts that will recur. We should point out that The vector r marks the position of an infinitesimal charge density that older literature has been omitted in order to make the perspec- interacts with another infinitesimal charge density at position r0 + R.The two infinitesimal charge distributions are separated by a distance |r (r0 + R)|. tive more timely or because a more recent case study makes the 8 The inverse of this distance can be separated into factors depending only on same point as an older one. For example, a paper by Ritchie r,onr0 or on R. and Copenhaver published in 1995 compared the electrostatic potential generated by an atom-centered multipole expansion (up to l = 3) with that generated by potential-derived charges infinitesimal charged density. This full summation is in fact a surrounding some natural and synthetic nucleic acid bases. six-dimensional (6D) integral running over the three-dimensional The multipolar electrostatics always improved the rms error, by at coordinate space of the left chargedistributionandthatofthe least 10% to 30%, resulting in differences as large as 15 kJ mol1. right distribution. The use of so-called addition theorems3 Such conclusions are reminiscent of later work9 or much more enables the expression 1/|r (r0 + R)| to be factorised into factors recent work10 by Slipchenko, Krylov, Gordon and co-workers, as that depend on a single variable only, that is r, r0 or R.This discussed in Section 2.1.4. factorisation leads to multipole moments. They can be pre- This perspective is organised as follows. The discussion computed, that is, calculated separately for the electron distribu- starts, in Section 2, by addressing the increased accuracy of tion described by r, and separately for that described by r0.Itis atomic multipole moments over point-charges when modelling important to realise that this pre-computation of the multipole the electrostatic interactions between molecules. The application of moments is done independently of the geometry of their multipolar electrostatics to both polar systems (water, hydrogen interaction. This explains the enormous advantage of this bonding, halogen bonding, biomolecules and solvation) and non- pre-computation because the 6D Coulomb integral does not polar systems is addressed. Section 2 also focuses on crystal have to be calculated anymore. Instead, it can be replaced by structure prediction of organic molecules in a separate subsection. two 3D integrals, each yielding the numerical values of the Subsequently, Section 3 addresses the efficiency of the imple- respective multipole moments. However, the price paid for this mentation of atomic multipole moments in the context of both huge advantage is possible divergence of the multipolar series molecular simulation and in the transferability of multipole expansion. In any event, multipole moments describe the full moments. Finally, Section 4 focuses on currently used multi- Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. complexity of the charge distribution at hand, whether it is polar methodologies are briefly discussed, in particular molecular, atomic, ionic, covalent or metallic. With increasing AMOEBA, SIBFA and NEMO. A rather poignant conclusion multipolar rank l the multipole moments ‘‘pick up’’ an increasing briefly summarises the current state-of-affairs. number of features of this charge distribution. The more compli- cated the deviation from isotropy of the charge distribution the more multipole moments are necessary, exactly in order to capture 2. Accuracy this anisotropy, which point-charges miss. The atomic charge, which is intrinsically isotropic, corre- 2.1 Polar systems and intermolecular interactions sponds to the zeroth moment or the monopole moment (l = 0). 2.1.1 Water. Early electrostatic potentials for water consisted It should be clear from the discussion above that the atomic of atomic point-charges fitted to reproduce the bulk properties of charge offers only the simplest of descriptions of two inter- liquid water. Examples include the simple point-charge (SPC) acting electron distributions. Higher rank moments (i.e. dipole model and the TIP3P potential. These potentials are still used11 moment or l = 1, quadrupole moment or l =2,etc.) successively today in spite of both suffering from the same known pitfalls, add more detail in their description of the electron distribu- such as accurately reproducing the experimentally observed

tion. The various terms of the electrostatic energy (appearing in radial distribution function (RDF) for OO(i.e. gOO(r)), or a the multipolar expansion) can be bundled by the interaction reliable dependence of liquid density on temperature. Attempts rank L. This rank is defined as the sum of the ranks of the at improving the description of water involve additional charge interacting multipole moments on site A and B incremented by sites, intended to represent the oxygen lone pairs. The TIP4P 2,12 13 one, or L = lA + lB + 1. This interaction rank is the inverse power and TIP5P potentials and the ST2 potential are of this type. appearing in the expression RL, in which R is the distance Despite an improved representation of the dielectric constant

between the respective sites at which the interacting multipole of bulk water and gOO(r) over TIP4P and TIP3P, TIP5P still

This journal is © the Owner Societies 2014 Phys. Chem. Chem. Phys., 2014, 16,10367--10387| 10369 View Article Online

Perspective PCCP

poorly reproduces properties such as the heat capacity and the multipolar expansion. A comparison to ab initio calculations of density versus temperature profile. the electric field inside a vacancy in ice showed that 70% of the The anisotropic site potential for water (ASP-W)14 uses an electric field is dipolar and that a hexadecapole was needed. atom-centred Distributed Multipole Analysis (DMA)15 expan- TIP4P and ASP-W4 were also used to model the behaviour of sion, with multipole moments up to quadrupole on oxygen and water adsorbed onto a NaCl surface.31 The experimental adsorp- dipole on the hydrogens, computed at MP2 level. When ASP-W tion isotherm for water on NaCl showed four distinct regions: a was compared with the point-charge potentials CKL16 and low coverage region, a transition region, a high coverage region NCC,17 and the multipolar potential PE,18 only PE provided and a presolution region.32 Monte Carlo simulations of the low comparably accurate minimum energy geometry for the water coverage and high coverage regions were performed using both dimer. The ASP-W potential has been further improved to water potentials. At high coverage, only ASP-W4 predicted a more ASP-W2 and ASP-W4.19,20 The atomic multipolar expansions for ordered structure, with three distinct layers of water due to ASP-W4 is now truncated at the hexadecapole level, and all inter- interactions between water molecules with the Na+ and Cl ions, action terms included up to rank L = 5. The ASP-W2/4 potentials give while TIP4P did not reproduce this layering. a more detailed description of the potential energy surface (PES) 2.1.2 Hydrogen bonding. Hydrogen bond interactions are of the water dimer than that of many other water potentials. not only strong, but are also observed to be highly directional. This potential was also used by Saykally and co-workers21 in the In many cases this directionality is due to anisotropic features interpretation of their terahertz laser vibration-rotation- in the electron density, most often as the lone pairs of the tunneling spectra and mid-IR laser spectra22 of water clusters acceptor atom.33–35 Isotropic atomic point-charges are unable from the dimer to the hexamer. Over a temperature range of to accurately reproduce experimental bonding geometries for 373–973 K, ASP-W2/4 gave values for the second virial coefficient, a range of molecules.36–42 The Buckingham–Fowler model,43 B(T), close to the experimental values, which is an improvement which combines DMA’s multipolar electrostatics with a simple over TIPnP (n = 3, 4 or 5) point-charge models. hard-sphere repulsive potential, provides several examples44 More recently, a novel non-polarisable, multipolar water where point-charges fail, either by leading to a spurious energy potential was published,23 with atomic multipole moments up to minimum, or giving quite misleading electrostatic energies. hexadecapole moment on all atoms (here called ‘‘QCTwater’’). The multipolar electrostatics of this model also successfully Multipole moments of so-called topological atoms were introduced, predicted the geometries of a great variety of van der Waals defined by Quantum Chemical Topology (QCT).24–26 QCT is a complexes.45 generalisation of the Quantum Theory of Atoms in Molecules,27 Efforts to model the directionality of hydrogen bonding which defines atoms as natural subspaces in the electron density within a point-charge framework either: (i) apply additional using the minimal concept of the gradient path.28 Molecular functions only to hydrogen bonding atoms, or (ii) add partial dynamics simulations were run on 216 water molecules under charges, typically at the positions of lone pairs. Allinger and periodic boundary conditions using QCTwater in order to test the Lii46,47 implemented a directionality term into the hydrogen reproduction of bulk thermodynamic and structural properties. bonding potential of the MM3 force field, improving agreement QCTwater predicted the maximum density to be at 6 1C, in good with the ab initio MP2/6-31G** values. Kollman et al. developed48 agreement with the experimental value of 4 1C. Monte Carlo a methodology for deriving additional lone pair point-charges for Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. simulations using TIP3P and SPC did not reproduce a maximum use within a revised version of the AMBER force field. The new density at all (within [50 1C, 100 1C]). TIP4P and SCP/E predicted potentials showed that the additional sites reproduced much of maximum densities at 15 1Cand38 1C, respectively. At a the directionality observed in MP2 calculations. The additional temperature of 300 K and pressure of 1 atm, QCTwater recorded point-charges also led to improved molecular dipole moments, adensityof996kgm3,only0.5kgm3 below the experimental in turn leading to more accurate thermodynamic properties value. Upon increasing the pressure, the experimentally observed upon molecular simulation. increase in oxygen coordination number from 5 to B7.5 was also Kong and Yan49 showed that multipole moments correctly reproduced by QCTwater. describe both the directionality and strength of hydrogen QCTwater also outperformed29 TIP5P when predicting bulk bonding for many systems. A minimum interaction rank of thermodynamic properties such as the diffusion coefficient, L = 3 was required to reproduce the bent structures of the thermal expansion coefficient and the isobaric heat capacity of dimers of the hydrides of N, O, F, S and Cl. ‘Bending’ forces liquid water. QCTwater was also able to reproduce both the arising from dipolar and quadrupolar interactions played a key experimental OO RDF and the plot of the experimental diffusion role in determining intermolecular bond angles. Similar results coefficient versus temperature to high accuracy. Due to the inclusion were found by Shaik et al.50 where at least L = 5 was needed to of atomic multipole moments, QCTwater produced a more reproduce the optimised ab initio structures for water clusters organised, directional hydrogen-bonded network in the first and the hydrated amino acids serine and tyrosine. Again, in and second hydration shell compared to TIP4P and SPC. models where only point-charges were included (i.e. L = 1), Because they have parameterised for the reproduction of the pseudo-planar ring geometries were predicted that ended up bulk properties of liquid water, most point-charge potentials poorly too ‘‘flat’’, i.e. the hydrogen atoms did not enough stick out of describe ice surfaces and small clusters. The ‘induction model’ the (approximate) plane formed by the oxygen nuclei. However, for water30 models each water molecule by a centre-of-mass as the number of water molecules in the cluster increased,

10370 | Phys. Chem. Chem. Phys., 2014, 16,10367--10387 This journal is © the Owner Societies 2014 View Article Online

PCCP Perspective

models including only lower order moments made better Indeed, van der Waals and exchange-repulsion energies can predictions than for smaller clusters. This is due to two effects: introduce errors of similar sizes as the penetration energy. (i) for larger clusters there is an increase in the number of long- Inspired by earlier multipolar simulations56 on liquid HF, range interactions, which are well described by low rank terms, Shaik et al.57 ran simulations on liquid imidazole (a hetero- and (ii) water molecules in larger clusters are locked into more cyclic aromatic ring) where the electrostatics are described by rigid hydrogen-bonded networks. This conclusion agrees with atomic multipole moments up to hexadecupole. Compared the observed success of many point-charge potentials capable to both OPLS-AA and AMBER simulations, QCT predicted a of describing ‘bulk’ properties, despite their inability to provide greater quantity of hydrogen-bonded imidazoles and a lower reliable results when implicit water molecules are present. quantity of stacked imidazoles. This is a consequence of higher Ponder et al.51 also came across the superiority of multipolar order multipolar electrostatics reproducing the directionality of electrostatics in their work on hydrogen bonding. They calculated the hydrogen bond, organising the molecules to form a more the hydrogen bond association energy of the formaldehydewater hydrogen-bonded network. QCT showed strong agreement with complex as a function of the O–HOQCangle,usingboththeir the experimental densities, whereas AMBER predicted densities own multipolar force field AMOEBA,52 the point-charge force field consistently much lower than experiment. OPLS-AA, and MP2/aug-cc-pVTZ. OPLS-AA is incapable of repro- The same authors also performed58 simulations at room ducing the energy minima at B1001 and B2601,whileAMOEBA temperature and pressure for aqueous imidazole solutions at showed a similar shape to the MP2 curve. different concentrations from 0.5 M to 8.2 M. The density of the In comparisons such as the one above, one should keep in solutions in QCT simulations depended on concentration, in mind the concept penetration of energy. At short range, even very good agreement with experiment up to 5 M, after which QCT when the multipole expansion still converges and were taken to started underestimating experiment. The AMBER potential consis- infinite order, the multipolar energy is in error by an amount tently underestimated the solution’s density for all concentrations called the penetration energy.3,53 For a typical hydrogen bond by almost 0.02 g cm3. The QCT system recovered the diffusion of 20 kJ mol1, the penetration energy is about 8 kJ mol1, coefficient for pure water. In contrast, AMBER predicted a signifi- which amounts to about 40% of the bond energy. As a result, cantly overestimated diffusion coefficient for pure water. The two improvements in the multipole expansion are of limited value potentials generated notably different local environments, as seen without simultaneous improvements in the penetration energy. by RDFs and spatial distribution functions (SDFs). A simple analytic calculation of the electrostatic interaction In 2014, the same group published59 a dual study on the between a proton and a hydrogen-like atom of nuclear charge Z hydration of serine: (i) static level, i.e. by geometry optimisation shows that the electrostatic potential V(r) in a point at a via energy minimisation of a microhydrated cluster of serine and distance r from the origin is not 1/r. Instead one finds that (ii) dynamic level, i.e. or by the molecular dynamics simulation and V(r)=1/r + exp(2Zr)(Z +1/r). After trivial rearrangement one RDF/SDF. At static level, multipolar electrostatics best reproduces can write V(r)=1/r[1 exp(2Zr)(rZ + 1)], where the latter the ab initio reference geometry. At dynamic level, multipolar correction factor is called a damping function. This function electrostatics produces more structure than point charge electro- becomes unity at long range and tends to zero at short range. statics does, over the whole range. The SDF shows that only Damping functions54 specifically for the electrostatic interaction multipolar electrostatics shows pronounced structure at long Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. appeared as late as 2000. The origin of the penetration energy is the range. Even at short range there are many regions where waters fact that the proton probe at whose position the electrostatic appear in the system governed by multipolar electrostatics but potential is evaluated, lies within the electronic charge cloud that not in that governed by point charges. generates the potential.55 It should be emphasised that topological Fig. 3 shows the distribution of water molecules58 in an atoms (see QCT) do not need a correction for penetration energy aqueous imidazole solution from the point of view of the because their finite volume makes it possible for a given point nitrogen atom in imidazole to which a hydrogen is bonded. This

to lie completely outside the (topological) atom (that generates atom is referred to as NH and each coloured dot (red or green)

the electrostatic potential). represents the position of a water’s oxygen atom. This NHO Secondly, the ultimate reliability of a force field is only as high as its overall balance of energy contributions. In other words, the quality of the multipolar electrostatics needs to be matched by a high-quality representation of the non-electrostatic terms, as well as the treatment of polarisation. The latter receives much attention in this article but this should not give the false impression that the other terms are not important. This high exposure to polarisation is because the main topic of this article is the electrostatic treatment in force fields and polarisation is tightly intertwined with it. In summary, one should recognise that a Fig. 3 Comparison of QCT (green) and AMBER (red) in terms of Spatial force field using multipole moments may be successful more Distribution Functions (SDFs) of NHO (isovalue = 2.0 (left) and 3.0 (right)). because of better parameterisation of the exchange-repulsion, for The carbon atoms are shaded in light blue. [Source: J. Phys. Chem. B, 2011, example, than because of the multipole moments themselves. 115, 11389.]

This journal is © the Owner Societies 2014 Phys. Chem. Chem. Phys., 2014, 16,10367--10387| 10371 View Article Online

Perspective PCCP

SDF shows that the distribution of oxygen atoms adjacent to NH positive charge was added to the halogen atoms of 27 halogen (AMBER, red) is asymmetrical at lower isovalues, such as 2.0 containing molecules,72 to mimic the position of the s-hole. (Fig. 3, left panel). At the higher isovalue of 3.0 (right panel), the The MM interaction energies of complexes of halogens with Lewis distribution becomes more circular and its centre coincides bases had a rms error of only 1.3 kcal mol1 relative to the MP2 with the N–H bond axis. In contrast, the distribution of energies. The inclusion of the EP charge sites also improved the neighbouring oxygen atoms in the QCT simulations (green) is molecular dipole moment for a range of halogenated molecules. In a always symmetrical and centred on the N–H bond axis. This medicinal chemistry application of the EP model, a simulation was case study is a clear example of the qualitative difference in carried out on 4,5,6,7-tetrachloro-, bromo-, and iodobenzotriazoles predictions made on solute–solvent structure by point charges intheactivesiteoftheenzymephospho-CDK2/cyclin. The distribu- versus multipole moments. Based on a dual RDF and SDF tions of the halogen bond angles were in good agreement with the analysis (beyond Fig. 3) this work58 also revealed pronounced known order of strengths of the different halogens in their bonding. differences in the number and ratio of stacked versus hydrogen- When the standard AMBER potentials were used without the EP bonded imidazole dimer in water. charge sites, no halogen bonding was observed, with the XO A ‘‘weak hydrogen bond’’60 is one where the donor atom is distances much larger than in the X-ray structures. Compared to EP, not a strongly electronegative atom. Typical examples include a multipolar force field avoids such ad hoc extensions altogether. C–HN/O61 or C–Hp.62,63 These interactions can be of signifi- Until such force fields were readably available, QM/MM calculations cance for the chiral recognition of a substrate by proteins and were suggested as an alternative to force fields.73 According to very also for stabilising the conformations adopted by important bio- recent work74 an approach to describe the geometries by electro- molecules.64 Simulations utilising classical point-charge force fields statics alone, without allowing for the anisotropy of the exchange do pick up on such interactions to some extent. However, the work repulsion, is likely to be unsuccessful. of Westhof et al. showed that the cutoff distance for electrostatic 2.1.4 Solvation. AMOEBA has been designed to overcome interactions must be large in order for weak hydrogen bonds to be the incapability of AMBER (e.g. ref. 75) and CHARMM of deal- observed.65 DMA quadrupole and octopole moments are necessary ing with polarisation, especially that of solvated ions, which to find the full range of observed structures of aromatic hetero- create large local electric fields. Each atom in AMOEBA is cycles interacting with water compared to when only monopole and represented by a permanent partial charge, dipole moment and dipole moments were used.66 Obviously, the widely used point- quadrupole moment, and many-body terms such as polarisation charge models such as AMBER, CHARMM and OPLS are currently are handled explicitly through a self-consistent dipole polarisation unable to account for such interactions. procedure. The AMOEBA force field has been applied to investi- 2.1.3 Halogen bonding. There is a growing literature gate the solvation of many ions in water,76–78 including Cl,Na+, describing what has been termed the ‘halogen bond’, where K+,Mg2+ and Ca2+. Grossfield et al.77 showed that despite the the halogen atom acts as an electrophile and interacts with a AMOEBA parameters being derived from calculations of gas-phase nucleophilic partner in a linear fashion. These linear halogen molecules, inclusion of polarisation terms allows both accurate bonds can be both as strong as hydrogen bonding, ranging and transferable single-ion solvation free energies and also solva- from B4 to 160 kJ mol1, and as directional. Because of this tion free energies of whole salts in both water and in formamide. directionality halogen bonds can also influence the structure of a The whole-salt free energies of solvation varied from experimental Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. system in a similar fashion to hydrogen bonds. It may therefore be resultsbyonly0.6kcalmol1 on average, whereas the OPLS-AA assumed (correctly) that atomic point charges will be insufficient and CHARM27 force fields deviated from experiment by 9.8 and to reproduce halogen bonding. The linear pattern of bonding was 6.6 kcal mol1, respectively. In the RDF of solvent molecules first reported by Ramasubbu et al.67 in 1986, who inspected the around the K+ and Cl the non-polarisable force fields show adopted crystal structures of halogen atoms within the Cambridge overstructuring, a consequence of fixed point-charges, favouring Crystallographic Database. Since its discovery, the halogen bond only a limited range of geometries. hasbeenthesubjectofmanystudiesonitsoriginandnature.68–71 2.2 Non-polar systems Torii and Yoshida showed that the quadrupole moment Yzz of halogen atoms, where the z-axis is defined as the direction of The ability to replicate p-interactions rests on toroidal electronic the C–X bond, describes a positive region on the surface of the features above and below the electron-poor plane in ring systems. halogenatom‘‘ontheopposite side’’ (or at 1801 degrees on the Spherical electrostatic potentials emanating from atomic centres79 z-axis where 01 is on the C atom). This region is commonly do not account for this type of system. The electrostatic properties referred to as the s-hole, and its position accounts for the of saturated hydrocarbons were modelled80 by a point-charge, +p, observed linear bonding to nucleophiles.69 Halogen bonding placed on hydrogen sites, and an opposing charge of 2p centred was proven to be dictated primarily by electrostatic effects through on the carbon. Whilst this assignment allowed for the reasonably 68 the work of Tsuzuki et al. who studied C6F6XandC6H6Xeach accurate prediction of hydrocarbon crystal structures, the model interacting with pyridine. failed for aromatic systems, even qualitatively. Price81 demon- Due to the observed anisotropy in the electronic distribu- strated that the use of DMA convincingly exposed deficiencies in tion, point-charges fail to correctly model halogen bonding. In this ‘‘separated point-charge’’ model. an attempt to introduce halogen bonding into the molecular A study82 on aromatic stacking proposed that p-stacking mechanics (MM) force field AMBER, an extra-point (EP) of arises from an interaction between the electron-rich toroids out

10372 | Phys. Chem. Chem. Phys., 2014, 16,10367--10387 This journal is © the Owner Societies 2014 View Article Online

PCCP Perspective

of the aromatic plane with the electron-poor s-backbone of a structure where hydroquinone fragments formed parallel-displaced another aromatic species. As such, a point-charge of p above configurations with the benzene molecule. In contrast, the usage of and below the aromatic plane, in addition to a compensatory multipole moments recovered a T-shaped conformation. +2p point-charge on each carbon atom in the plane, accounts Such aromatic complexes are dominant in biological systems, for these electronic features. This model may recover the forming essential stabilising elements in nucleic acids,89 proteins90 preferred parallel-displaced conformation of two aromatic or carbohydrate–protein interactions in immune complexes,91 molecules. Given a system of two aromatic complexes, for to name a few. Biological processes such as molecular recogni-

example C6H6C6H5X, one may postulate relative interaction tion and catalysis are frequently stabilised by interactions energies based upon the identity of X. An electronegative group between the p-density of aromatic systems. Point-charges

seizes electronic population from the p-system in C6H5X. This provide a poor description of the electronic distribution of effect results in a decreased electrostatic repulsion between the aromatic systems, and the XED force field aimed79 at capturing two interacting p-systems, and enhances the net electrostatic the anisotropy by the addition of extra point-charge sites. It was interaction. An electropositive group, on the other hand, would able to correctly predict the edge-to-face stacking for a range of contribute towards the p-system, thus inducing a net electro- substituted polyphenyl species, whereas AMBER, OPLS, MM2 static repulsion by the opposing mechanism. Such a simplistic and MM3 were not. model has been criticised by several groups,83 claiming that an The work of Hill et al.92 also demonstrates the important role enhanced interaction is observed relative to the benzene dimer, of electrostatics in stabilising aromatic stacking interactions due regardless of the identity of X. It was, however, confirmed that to a degree of cancelling of the attractive correlation dispersion electron-withdrawing groups enhanced the interaction more term by exchange repulsion and delocalisation effects. Gordon than those which were electron-donating. and co-workers10 used their own ‘Effective Fragment Potential’ The subtle role of electrostatics in such small non-polar (EFP) method to investigate the interactions between nucleic systems implies that their modelling requires an equally subtle acid bases. The EFP method is described as a low cost alternative description of underlying electronic properties, where dispersion to ab initio calculations, and can be considered as a polarisable is also shown to be a key factor.84 Such distinct electrostatic multipolar force field without empirically fitted parameters. features may by captured by the implementation of multipole A DMA was performed on atomic centres and bond midpoints moments, shown in work85 where three benzenoids with large up to octopole moment. The EFP method accurately reproduced negative, neutral and large positive quadrupole moments were the interactions energies between stacked dimers AA and

complexed with a small molecule (HF, H2O, NH3 and CH4) TT, with deviations from MP2 energies within 1.5 and geometry optimised. The multipole moments clearly governed 3.5 kcal mol1,respectively. the energetically favoured geometries of the various complexes. Tafipolsky and Engels implemented an extension to Past work86 demonstrated that a single central multipole AMOEBA showing a much improved description of stacked moment expansion diverges with increasing expansion rank. aromatic systems,93 including atomic multipole moments up However, a distribution of the multipole moments over the to hexadecapole, with dipolar reparameterised polarisabilities, atoms, such as in DMA, overcomes this problem. Such a and a specific short-range charge penetration term. When method recovers correct orientations and electrostatic inter- compared against AMOEBA, MM3 and OPLS-AA, the new model Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. action energies. In a different study, electrostatic minima for showed values for the energies of both the stacked and T-shape several van der Waals complexes were located87 by a point- dimers of benzene closer to accurate symmetry adapted perturba- charge model and a full DMA up to hexadecapole moment. A tion theory (SAPT)94 values. notable example in this work utilises the Buckingham–Fowler Marshall et al.95 ran simulations on b-hairpin structures of model to predict five minimum energy conformations of the model polypeptides involving cation–p interactions between + benzene dimer. The point-charge model predicts the global cationic (Me)n-Lys residues and two aromatic tryptophan side minimum to be the parallel dimer. In contrast, the DMA model chains (n = 0, 1, 2, 3). Simulations were run using the multi- yields a conformation that complies with the ab initio level polar polarisable force field AMOEBA, and OPLS-AA, CHARMM calculation, whereby the most favourable conformation is that and AMBER. Only AMOEBA reproduced the experimental NOEs of the parallel-displaced dimer. In addition to this, relative distances between the lysine and tryptophan with any consistency, energies between the five minima were found to be poorly accurately predicting over 80% of the observed NOEs across the represented by the point-charge model compared to full DMA. four systems (Fig. 4). The point-charge force fields only predicted This inability of point-charge electrostatics to reproduce 40–50% of the observed NOEs in two simulations, and performed ab initio derived conformations of benzene dimers has been worse still for the remaining ten simulations with a prediction reiterated in the work of Koch and Egert.88 To demonstrate that success rate of B10%. the inclusion of anisotropic electrostatic features is imperative not only in small, isolated systems, they considered an additional, 2.3 prediction supramolecular system of a benzene molecule situated within the To accurately predict the structure into which a molecule will cavity of a hexa-oxacyclophane host. Here, a T-shaped complex crystallise, a computational model must provide a rigorous is formed between the benzene and hydroquinone fragments of description of both bonded and non-bonded terms, as well as the cyclophane. A point-charge energy minimisation resulted in sampling the entirety of the PES. We restrict the discussion to

This journal is © the Owner Societies 2014 Phys. Chem. Chem. Phys., 2014, 16,10367--10387| 10373 View Article Online

Perspective PCCP

Crystallographic Data Centre (CCDC), who encouraged groups to participate in a series of five blind tests.97–100 These tests were organised as competitions in which participants were invited to predict a range of unknown crystal structures as seen in Fig. 5 and 6. In each competition, the participating groups used a range of computational methods by including point- charge, multipolar and statistical approaches. A summary of the results of the five blind tests can be seen in Table 1. At first glance, the results of the early tests called CSP1999, CSP2001 and CSP2004 suggested that methods with a multipolar description of the electrostatics provided no greater reliability for predicting the correct crystal structure relative to point- Fig. 4 Summary of the percentage of experimentally observed NOEs for charge models. For example, the point-charge electrostatics of the four model b-hairpin peptides (lysine residues) predicted by 100 ns MD Verwer and Leusen’s MSI-PP101,102 method outperformed the simulation in explicit solvent by the four force fields compared. [Source: multipolar computer program DMAREL103 method of Price J. Am. Chem. Soc., 2012, 134, 15970.] et al. in the CSP1999 test. Post-competition analysis revealed that the searching algorithm was to blame rather than the how a multipolar description of the non-bonded electrostatic multipolar force field. This conclusion turned out to be the term can improve prediction accuracy relative to point-charges. recurring message across all three early tests. The test set of Factors effecting other contributions are discussed in detail small, rigid molecules containing only C, H, N and O were elsewhere.96 generally predicted correctly (with multipolar electrostatics Typical work assumes that a given molecule will adopt a providing a slight advantage over point-charges). However, crystal structure with the lowest possible lattice energy. The molecules with a high degree of conformational flexibility were corresponding ranking criterion was used by the Cambridge not being sampled thoroughly and as a result, the experimental Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25.

Fig. 5 The structures used in the first four blind tests on crystal structure prediction (CSP). Source: Int. Rev. Phys. Chem., 2008, 27,541.

10374 | Phys. Chem. Chem. Phys., 2014, 16,10367--10387 This journal is © the Owner Societies 2014 View Article Online

PCCP Perspective

Fig. 6 The six structures used in the CSP2010 blind test. [Source: Acta Crystallogr., Sect. B: Struct. Sci., 2011, 67, 535.]

structures were not identified. The results of the fourth blind combined multipole moments with the Dreiding force field, test,100 CSP2007, showed that with the implementation of and compared the predictive capabilities of the new model to improved searching algorithms, the multipolar electrostatic point-charges.112 Multipole moments were able to correctly method of Price et al.100 consistently outperformed methods predict three out of the five experimental crystal structures with point-charge electrostatics. as the most stable crystal polymorph, compared to only one Following the success of the CSP2007, a fifth blind test104 by point-charges. was organised named CSP2010. The test molecules used in this Day et al. observed113 that for 50 organic molecules with study can be seen in Fig. 6. The improved results of CSP2007 many polymorphic crystal structures, lattice energy minimisa- led to the introduction of two new categories of molecule: a tion using atomic point-charges were considerably less accurate larger, highly flexible molecule and a hydrate with multiple for molecules with hydrogen bond donor–acceptor groups polymorphs (four polymorphs tested for prediction), leading than for those without. The point-charge descriptions within to a total of nine crystal structures to be tested. Three partici- the FIT,114,115 W99,116–118 DREIDING,119 CVFF95120–122 and pating groups (Day,105–107 van Eijck,108 and Price et al.105,109) COMPASS123 force fields used were described as being too used atomic multipole moments to describe the electrostatic simplistic to describe strong, highly directional bonds that guide contribution to the crystal lattice energy. Multipole moment crystal formation. The presence ofstronghydrogenbondingleads methods clearly outperformed the point charge methods (see to higher energy barriers between different minima on the PES, Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. Table 1), with multipolar methods correctly predicting four of and acts to trap crystals in the local ‘‘metastable’’ states. An the nine structures, compared to only one correct prediction by atomic point-charge description flattens these barriers resulting the point charge methods. It is interesting that for molecule in structures moving to lower energy minima during relaxation XIX van Eijck switched to point charges rather than multipole stages in the lattice energy calculation. For example, point-charges moments and was able to predict the correct structure. This were unable to predict the experimental ‘‘stepped sheet’’ structure highlights the importance of factors other than the electrostatic of 2-amino-3-nitropyrimidine due to the crystal relaxing into the description when predicting crystal structures. The most consistent energy well of another polymorph. method was GRACE of Neumann et al.,110,111 which used plain- Sometimes multipole moments do not appear to offer any wave DFT to calculate the electrostatic energy, which one would clear advantage over point-charges although, generally, it is found expect to outperform even multipolar methods. that factors other than the electrostatic potential are responsible Day et al. compared two electrostatic schemes, one an atomic for the observed inferiority of multipole moments. A novel point-charge scheme and the other including multipole moments, electrostatic potential built for the MM3 force field was tested for their ability to predict the 64 experimentally observed crystal on the crystal structures of oligothiophenes124 and atomic point- structures of 50 organic molecules. The multipolar scheme charges outperformed multipole moments for all but one case, reproduced 44 of the experimental structures to be within the namely that of a-perfluorosexithiophene (PFT4). The crystal struc- top five most stable crystal structures for each given molecule, ture for PFT4 was the structure most influenced by electrostatic whereas the point-charge scheme was able to find only interactions, an instance where one should not be surprised that 36 structures. Multipolar electrostatics also correctly predicted multipolar electrostatics were superior. Brodersen et al. compared 32 of the compounds to have structures within 0.5 kJ mol1 five electrostatic models including ESP derived point-charges compared to only 23 predicted by point-charges. In a response and tested multipole moments in the prediction of 48 crystal to the poor results of the CSP1999 blind test, Mooij and Leusen structures, again using the DREIDING force field.125 Due to

This journal is © the Owner Societies 2014 Phys. Chem. Chem. Phys., 2014, 16,10367--10387| 10375 View Article Online

Perspective PCCP

Table 1 The number of successful predictions of the crystal structure of competing methods.126 It is, however, widely regarded as a the 21 molecules used in the CSP blind tests using different electrostatic necessary measure to define electrostatic properties as pre-defined models (point charges, multipole moments or other). The number of parameters for large scale molecular simulation to be truly viable. methods corresponds to the maximum number of groups using an electrostatic method in the above test. This number may vary between The generation of a transferable set of multipole moments is structures within a blind test as not all groups attempted to predict all a far more delicate operation than trying to find a corresponding structures. The numbers outside of parentheses are the number of set of partial charges. Whilst a monopole moment is relatively successful predictions within the top three structures provided by a transferable, higher order multipole moments are less so due to method, and the numbers within parentheses are the number of correctly their increasing directional dependencies. The latter make it predicted structures outside of the top three structures. Only successful top three results are included in CSP1999 study more difficult to obtain a generic set of higher order multipole moments for a given atom type. Point charge Multipole Other Many molecular and group properties are tractable when CSP1999 attempting to demonstrate transferability. One finds that Number of methods 6 2 3 experimental heats of formation, for example, may be repro- I 400 II 100duced for a generic hydrocarbon CH3(CH2)xCH3, by fitting to a III 100linear relationship DHf =2A + xB. Here, A and B represent the VII 010respective energies of methyl and methylene groups. Indeed, CSP2001 this function is equally applicable to SCF single-point energies Number of methods 11 3 5 for equivalent systems, such that E =2E(CH3)+xE(CH2)is IV 1(8) 1(0) 0(0) accurate to approximately 0.06 kcal mol1.127 Based on this V 3(5) 1(1) 0(1) VI 0(4) 0(0) 0(0) additivity of single point energies, the concept is easily extended to imply the additivity of electron correlation ener- CSP2004 gies. Because the correlation energy is a functional of a group’s Number of methods 11 3 4 VIII 1(4) 2(2) 1(1) electron density, it implies that electronic properties must IX 0(5) 1(1) 0(2) additionally follow this transferability scheme.128 X 0(4) 0(2) 0(1) Armed with this, the demonstration that multipole moments XI 0(0) 0(2) 0(1) possess129 some amenability to atom typing should follow. CSP2007 In one case study,130 a set of small molecules composed of Number of methods 5 4 5 the functional groups present in proteins underwent DMA at XII 0(2) 1(2) 3(0) XIII 0(2) 3(1) 1(0) HF/3-21G level, and the multipole moments of each atom were XIV 0(1) 2(1) 1(0) assessed. Atom typing by atomic number or hybridisation state XV 0(0) 1(1) 1(0) was seen to be ineffective, but atom typing by bonding to CSP2010 specific functional groups proved to be more successful. Two Number of methods 5 3 6 transferable schemes were developed: ATOM and PEPTIDE. XVI 0(2) 1(2) 1(0) The former utilised the average multipole moments for specific XVII 0(2) 1(2) 1(0) atom types generated from the data set mentioned previously. Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. XVIII 0(3) 0(0) 1(0) XIX 1(2) 0(1) 1(0) The PEPTIDE model features a single multipole moment XX 0(0) 2(0) 0(1) expansion centre for each distinct amino acid. As such, the XXI (i) 0(0) 0(3) 0(1) XXI (ii) 0(1) 2(1) 0(1) local environment for each of these expansions centres is XXI (iii) 0(2) 0(3) 0(1) conserved for a given amino acid. The usage of ATOM resulted XXI (iv) 0(1) 0(3) 0(1) in substantial deviations from the ab initio electrostatic potential while the PEPTIDE model gave far more satisfactory results. Extending from this, a grossly distorted cyclic undecapeptide strong dependence on intramolecular terms in the force field, (a derivative of the immunosuppressive cyclosporine) was ana- such as angle bends, bond stretches and torsion angles, the use of lysed by the above two models. The authors compared the multipole moments did not improve the accuracy of the predicted electrostatic potentials generated by these models with one crystal structures for flexible molecules. They did, however, greatly generated from DMA. Again, the PEPTIDE model exhibited improve the prediction of rigid molecule crystal structure, where lower average errors then ATOM. thebondedtermsareoflessimportance. Many years later, Mooij et al.131 focused on the generation of an intermolecular potential function implemented in dimers 3. Computational efficiency of and trimers of methanol. Using a fitted electrostatic term in the multipolar electrostatics intermolecular potential resulted in relatively favourable results: 0.2 kcal mol1 and 1.6 kcal mol1 deviations in the dimer 3.1 Transferability and trimer energies, respectively, from counterpoise-corrected The idea of an atom type is inextricably linked with that of MP2/6-311+G(2d,2p) calculations. This is still more impressive transferability. Whilst complex definitions of an atom type have than similar studies on other less complex systems that been proposed, this area remains a source of debate and have attempted to parameterise point-charge electrostatics.132

10376 | Phys. Chem. Chem. Phys., 2014, 16,10367--10387 This journal is © the Owner Societies 2014 View Article Online

PCCP Perspective

Mooij et al.131 also worked on a methanolwater and a Later, an extensive investigation141 was carried out for atom typing methanedimethylether complex. Each of these molecules by atomic electrostatic potential rather than atomic multipole was assigned a set of atom-centred multipole moments. Equally moments as in the previous study. A retinal-lysine system was impressive results were obtained, with all interaction energies considered, a prominent feature in the mechanism of bacterior- replicated to within B0.2 kcal mol1 of the corresponding ab hodopsin. This study focused on the aldehyde and terminal amino initio calculations. As such, it was concluded that atom-centred groups of retinal and lysine, respectively. The electrostatic multipole moment expansions are indeed transferable between potentials generated by these groups occurring in the full system the same molecules in differing environments. were compared with those of smaller derivatives of the system. There are several ways of allocating molecular electronic The electrostatic potential of lysine surrounding the terminal charge to atoms (e.g. DMA,133 CAMM134 or QCT partitioning135). amino group was relatively conserved for all derivatives in which Considering our group’s research interests, we focus here on two (methylenic) carbon atoms were maintained along the QCT-based techniques. Focusing on energy, Bader and Beddall136 amino acid sidechain. However, the aldehyde group of retinal demonstrated that: was more responsive to more distant environmental effects. (1) The total energy of a molecule is given by a sum over the This work has recently been further developed,142 where the constituent atomic energies. concept of a ‘‘horizon sphere’’ is proposed. This sphere con- (2) If the distribution of charge for an atom is identical in tains all the atoms that a given atom, at the sphere’s centre, two different systems, then the atom will contribute identical ‘‘sees’’ in terms of their polarisation of the electron density on amounts to the total energy in both systems. the central atom. An a-helical segment of the protein crambin Although these conclusions are given in terms of energy, was studied. The electrostatic energy was probed by considera- they hold for any property density of an electronic distribution tion of the multipole moment expansion (up to rank l =4)

over an atomic basin. In light of this fact, it was shown by centred at a Ca. A new set of multipole moments for Ca was Laidig137 that multipole moments, under certain constraints, calculated for each structure dictated by the growing horizon

adhere to the above conclusions, and so exhibit transferability. sphere. The interaction energy between Ca and a set of probe The property of transferable multipole moments was success- atoms was evaluated, leading to the conclusion that formal fully adopted by Breneman and co-workers,138 in a method convergence of this interaction energy is attained at a horizon denoted Transferable Atom Equivalents (TAEs).139 Primarily, a sphere radius of B12 Å. More work is underway to scrutinise library was generated consisting of atom-based electron density the validity and generality of this conclusion. fragments generated from a QCT decomposition of a set of Crystallographers who strive for the generation of transfer- molecules. DMA was subsequently performed on each of these able atomic electron densities,143 find qualms with the recon- fragments. These TAEs may then be geometrically transformed struction of molecular electron densities from these QCT-derived into a novel system for which the electrostatic potential is atomic densities. This is due to the mismatch in interatomic surface required. The fact that atomic property densities are additive topologies between transferred atoms. As such, they believe that it in QCT implies that this recombination of TAEs is sufficient to becomes very difficult to generate a continuous electron density reproduce an electrostatic potential of the system to a quasi- from these atomic fragments. Work has therefore been directed ab initio level of theory. It should, however, be noted that the towards the generation of pseudoatom databanks that may be Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. transferability of atomic basins is approximate, and so this utilised to reconstruct experimental electron densities from pre- method will necessarily carry a small error. viously elucidated structures. From this approach follows a natural The efficacy of this methodology was subsequently demon- output in the form of atomic multipole moments. It is important to strated through three ‘‘peptide-capped’’ molecules: alanine, point out that the aforementioned mismatch can be countered by diglycine and triglycine. The analytical electrostatic potentials accepting that the interaction energy between atoms is what ultimately were computed on 0.002 au isodensity surfaces. Equivalent matters, not the perfect construction of gapless sequences of electrostatic potentials were also generated from TAE-reconstructed topological atoms. With this premise in mind we have shown that systems and Gasteiger point-charges for the extended (open) the machine learning method kriging captures,144–146 within reason- and a-conformations. The TAE multipole analysis (TAE-MA) able energy error bars, the way a QCT atom changes its shape in reproduced the features of the electrostatic potential generated response to a change in the positions of the surrounding atoms. at ab initio level much better than the Gasteiger point-charges Jelsch et al.,147 for example, demonstrated the capacity of did. A point-charge electrostatic model is unable to accurately transferring experimental density parameters for small peptides, predict extremes in electronic features. based upon the Hansen–Coppens formalism,148 and subsequently In later work carried out in this group, all 20 naturally built a databank of pseudoatoms. The refinement of high resolution occurring amino acids and their constituent molecular fragments X-ray crystallographic data by referral to this databank has been were rigorously assessed140 using QCT. A set of 760 distinct demonstrated.149 A more computationally-orientated route has been topological atoms were generated and cluster analysis identified developed in parallel to the one above,143 whereby the experimental a set of 42 atom types in total (21 for C, 7 for H, 6 for O, 2 for density parameters for a set of pseudoatoms were derived from N and 6 for S). The trivially assigned atom types implemented in ab initio electron densities of tripeptides. This method showed AMBER were either too fine-grained (e.g. too many for atom types an enhanced amenability to transferability compared to its forN)ortoocoarse-grained(e.g. C atom types not diverse enough). experimentally-derived counterpart.

This journal is © the Owner Societies 2014 Phys. Chem. Chem. Phys., 2014, 16,10367--10387| 10377 View Article Online

Perspective PCCP

More recently this pseudoatom database has been built both of these points, the poor level of transferability for QCT upon.150 Atom types were defined by grouping atoms with the and Hansen–Coppens pseudoatoms is constrained to specific same connectivity and bonding partners, while the atom type atom types; for the rest, these techniques generally give rise to properties were defined by averaging over all constituent the most transferable multipole moments. ‘‘training set’’ pseudoatoms. Single point calculations were Whilst the lower-order QCT multipole moments are largely initially carried out on a test set of amino acid derivatives at non-transferable, they tend to be far more stable when derived B3LYP/6-31G** level. The geometry of each species was taken from crystallographic data. In spite of this, they are still the directly from the Cambridge Structural Database (CSD).151 least transferable multipole moments in the set, with both Multipole moment expansions for non-hydrogen atoms were Hirshfeld partitioning and the Hansen–Coppens pseudoatom truncated at ranks l = 4 and l = 2 for the hydrogens. These formalism yielding somewhat more stable multipole moments. multipole moments were subsequently averaged and standard The authors conclude that the most transferable multipole deviations defined for the dataset. In terms of performance, the moments result from Hirshfeld partitioning. QCT discretely databank model appears to give a slightly more pronounced partitions electron density into distinct basins and so is vulner- electrostatic potential surrounding oxygen atoms in carboxylate able to numerical issues when undertaking mathematical and hydroxyl groups of Ser, Leu and Gln, compared to the more operations such as integration over the basin. The Hansen– extensive ab initio calculations. Further work showed that the Coppens formalism, on the other hand, suffers from problems databank does relatively well in the prediction of most atomic with localisation: distant electron density may be incorrectly multipole moments. Exceptions take the form of higher order assigned to a given nucleus. However, an exhaustive study of multipole moments, most particularly for oxygens and nitrogens. standard deviations from average multipole moments for given Finally, we mention that somewhat poorer results are obtained partitioning methods does little to confirm the dominance of when considering total intermolecular electrostatic energies in one scheme over another. Transferability matters little if the dimers. The errors are of the same magnitude as those obtained multipole moments in question are incorrectly defined; their from AMBER99, CHARMM27 and MM3. The authors ascribe subsequent variances over a dataset are inconsequential. In these results to the implementation of a Buckingham-type fact, the difficulty in assigning transferable multipole moments approximation, whereby non-overlapping electron densities are to given atom types may equally be indicative of poor atom type assumed. This results in the underestimation of short-range definition, or the sheer inability to define a transferable atom interactions, which is in keeping with the sign of DE in the above type in terms of multipole moments with any great stability. We calculations. The authors report much-reduced discrepancies in make a final note in that the atom types defined in this work these energies by use of their own refined method, which have been tailored for pseudoatom usage,154–156 and so may not accounts for this discrepancy.152 be useable with a discrete partitioning scheme (QCT), relative However, in spite of the issues of the reconstruction of crystal to the so-called ‘fuzzy’ decompositions (Hirshfeld and Hansen– structures by use of QCT, the technique remains amenable to the Coppens). In fact, this is concomitant with an analysis of dimer elucidation of electrostatic properties. Woin´ska and Dominiak153 energies obtained from these three techniques.157 When one uses have given a thorough elaboration on the transferability of multipole moments obtained by a QCT decomposition as opposed atomic multipole moments based on various density partitions, to a Hirshfeld partitioning, the electrostatic energy obtained more Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. most notably directly comparing the Hansen–Coppens formal- closely resembles that obtained from a Morokuma–Ziegler energy ism to both QCT and Hirshfeld partitioning. In their study, decomposition scheme, by as little as 10%. multipole moments (up to l = 4) were assigned to each atom in a set of biomolecular constructs, ranging from single amino 3.2 Simulation acids to tripeptides. Atom types were subsequently defined A recent tour de force regarding the feasibility of biomolecular from this molecule set based on criteria resembling those used simulation have seen the computation of time trajectories of systems by a similar study.154 By averaging the multipole moments such as the ribosome.158 However, this study implemented techni- for given atom types in differing chemical environments, ques not optimised for the output of particularly accurate results, a standard deviation from this average value was obtained. using a highly parallelisable CHARM++ interface in conjunction A lower standard deviation is indicative of a high degree of with the AMBER force field and the NAMD molecular dynamics transferability, and vice versa. A QCT analysis of ab initio package,159 i.e. a partial charge approximation. wavefunctions results in highly non-transferable lower-order Biomolecular simulation requires the implementation of per- multipole moments (l = 0, 1, 2). Secondly, for the higher-order iodic boundary conditions to accurately model the environment Hansen–Coppens multipole moments (l = 3, 4) are particularly in which a system resides. Moreover, the electrostatic energy of a unstable. The atom types found to be non-transferable from the system is slowly convergent. Many solutions to this problem have QCT analysis are generally carbons connected to two electro- been proposed over the years.7,160 It should be noted that the negative atoms (oxygen or nitrogen), or members of aromatic interaction involving ‘higher order’ multipole moments (l Z 1) is systems. The decline in the level of transferability for higher- more short-range than that between monopole moments. As order multipole moments for the Hansen–Coppens method is such, the problem of slowly convergent long-range interactions relatively widespread throughout atom types, most prominently is shared by both isotropic point-charges and multipolar electro- carbon and nitrogen. It is, however, strange to note that for statics because the latter encompass point-charges.

10378 | Phys. Chem. Chem. Phys., 2014, 16,10367--10387 This journal is © the Owner Societies 2014 View Article Online

PCCP Perspective

We wish to raise the issue of the conformational depen- dence of electronic properties. In reality, this problem is not unique to higher order multipole moments. Conventional force fields, which employ partial charges, choose to reside in a pseudo-reality of an invariable electrostatic representation, and so rarely encounter conformational dependencies. It is rather more difficult to simply ignore the obvious reality of molecules as flexible entities when using multipole moments, and has been emphasised in analyses using both DMA and CAMM algorithms.161,162 Use of the electrostatic properties for one conformation correctly repro- duces its corresponding electrostatic potential. However, use of this parameterisation in an alternative conformation results in highly unfavourable energies. It should be noted that this insuffi- ciency is equally prominent when using a conserved set of partial charges between conformers. As such, since the issue of flexible molecules is a computational complexitythatpervadesallelectro- Fig. 7 Error (kJ mol1) in the total electrostatic energies predicted by static approximations, it would be unfair to regard this as an machine learning (kriging) for a set of 24 local energy minima of capped histidine. The mean energy error of the sum of 328 atom–atom electro- additional burden specifically for multipole moments. Instead, it static energy values in capped histidine is 2.5 kJ mol1. The maximum error is a hurdle that both techniques are required to overcome in is 11.9 kJ mol1. One can read off the curve that B80% (of the 539 test enhancing the accuracy of simulation. configurations) have an error of less than 1 kcal mol1 (B4 kJ mol1). Evaluating multipole moments as a function of a conforma- [Source: J. Comput. Chem., 2013, 34, 1850.] tional parameter is appealing. For example, the multipole moments for both atoms in CO may be described analytically as a function of the interatomic distance in the molecule.163 How- directly by first and second derivatives of interactions energies, ever, scaling this idea up to systems with far more conformational by use of translational and rotational differential operators. If degreesoffreedom,suchasanaminoacid,isanappreciably one considers a molecule as a rigid body, the individual atomic more difficult task. An ‘‘analytical compromise’’ has been multipole moments of the molecule are invariant relative to proposed in the past,164 whereby the multipole moments of an their stationary local axis systems. As such, the derivative of the atom in ethanol, glycine and acrolein are represented by a Fourier interaction energy between two molecular species is satisfied by series truncated at third order, whose free variables correspond to the derivative of the interaction tensor only. This has been the dihedral angles of the molecule. demonstrated in the spherical tensor formalisms167,168 and its Instead of analytical methods, machine learning methods application (e.g. ref. 169). A simulation package that allows for can be used to interpolate between a set of multipole moments this rigid-body approximation in conjunction with multipolar defined for different molecular conformations.146 As such, electrostatics exists170 and is called DL_MULTI. one may then predict the multipole moments of an arbitrary The invariance of atomic multipole moments with respect to Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. conformation that is not present within the initial training a local axis system no longer holds when abandoning the rigid- set, which corresponds to a true external validation. Fig. 7 body approximation in favour of a realistic flexible-body proto- demonstrates the errors for (double peptide-capped) histidine col. As such, differentiation of the interaction energy function obtained in following such a scheme. requires the derivatives of multipole moments in addition to Alternative methods have been developed but the literature the interaction tensor. Whilst this requires a more involved on these techniques appears to be relatively sparse. For example, series of calculations, it is still an attainable requirement.171 it has been proposed165 that one may average the atomic multi- Somewhat more problematic is the fact that the local axis pole moments over all conformers that are sampled during a systems in which the atomic multipole moments are referenced simulation. This has been done for alanine and glycine by evolve over the course of a simulation. Due the flexible nature shifting the higher order (l = 1) atomic multipole moment of the molecule, neighbouring atomic positions that make up a expansions to a smaller number of expansion sites distributed local axis system change with respect to time. This results in the throughout the molecule. An additional method, previously subsequent net rotation of the atomic local frames. In the tested for energy minimisations of crystal structures, revolves context of QCT and the machine learning method kriging, around periodically recalculating the atomic multipole moments analytical forces can be calculated for ‘‘flexible, multipolar for the molecule.166 This proved to give substantially better atoms’’, although this is not trivial and will be published in results than the implementation of then-current methodologies, the near future.172 particularly for systems whose structures are dictated by strong Literature quotations of CPU time differences between a hydrogen bonding. molecular dynamics simulation using point-charges versus Forces (and torques) must be calculated for the molecular multipole moments (for a given number of nanoseconds, of a translational and rotational degrees of freedom to be sampled given biomolecule with a given number of water molecules during the course of a simulation. These may be formulated surrounding it), are virtually non-existent. However, Sagui et al.173

This journal is © the Owner Societies 2014 Phys. Chem. Chem. Phys., 2014, 16,10367--10387| 10379 View Article Online

Perspective PCCP

reported a representative ratio of 8.5 in favour of point-charge A true demonstration of the benefits corresponding to electrostatics (as implemented in AMBER 7) when most of the multipole moments arises from a direct comparison between calculation is moved to the reciprocal space (via the PME AMOEBA and the various force fields that employ point-charges. method) with multipolar interactions up to hexadecapole– Kaminsky´ and Jensen,180 for example, sampled the number of hexadecapole being included. The only way a point-charge energetic minima of glycine, alanine, serine and cysteine one model can ever match the accuracy of a nucleus-centred multi- recovers at MP2 level. The number of minima and their relative polar model is via the introduction of extra off-nuclear point energies were subsequently compared to those recovered by use charges. What is rarely stated is that these additional charges of AMOEBA and seven other point-charge force fields. The create an enormous computational overhead in a typical system results for serine and cysteine are outlined in Table 2, where of tens of thousands of atoms because charge–charge interac- the ab initio data suggests 39 and 47 minima, respectively. We tions are longer range (1/r-dependence) than any interaction see that AMOEBA consistently outperforms the large majority of between multipole moments. point-charge force fields in terms of the mean absolute deviation (MAD) of energies relative to the MP2 results. AMOEBA addi- tionally outperforms all other force fields in terms of the number of the minima it predicts for each amino acid. Note that the 4. Implementation of multipolar latter result gives rise to an artificially large MAD value relative to electrostatics the other force fields. Considering the aforementioned more favourable MAD corresponding to AMOEBA, this only empha- 4.1 AMOEBA sises the predictive capacity of AMOEBA. Arguably one of the most successful next-generation force fields Another study that has directly compared AMOEBA to a is AMOEBA (Atomic Multipole Optimised Energetics for Bio- variety of other conventional force fields (AMBER, MM2, molecular Applications).52 AMOEBA has been proven effective MM3, MMFF and OPLS) is that of Rasmussen et al.,181 where in a variety of biomolecular simulations, ranging from solvated relative conformational energies were approximated. A set of ion systems77,174,175 to organic molecules176 and peptides.177–179 minima were generated for several molecules, each with inter- The electrostatic energy component of the force field is broken mediary electrostatic properties ranging from entirely non-polar to down into two terms. The first term is concerned with permanent zwitterionic. The ability of the various force fields to predict relative atomic multipole moments (expansions truncated at l =2),whilst energies of the minima was probed, in addition to three separate the second term corresponds to induced multipole moments as a AMOEBA parameterisation schemes, differing in atoms typing or result of polarisation effects. The permanent atomic multipole level of theory. All force fields performed extremely well for the non- moments are generated by DMA of a set of small molecules such polar molecules, largely due to the minor electrostatic contribution that atom types may be defined. When implemented during a to the conformational energy of non-polar molecules. As such, the simulation, these atomic multipole moments may be rotated into level at which electrostatics were calculated is essentially irrelevant. various fixed local axis systems within the molecule. AMOEBA However, as the molecular species become more polar in nature, proves to be competitive, even with ab initio level calculations. All the point-charge force fields begin to display their erroneous nature levels of theory tested perform in a uniform manner: the average relative to the AMOEBA parameterisations, which demonstrate a Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. DE values across all conformations at the MP2/TQ, oB97/LP, more uniform predictive capacity. The zwitterionic species were B3LYP/Q and AMOEBA levels of theory are 3.73, 3.15, 3.64 and modelled well by several of the point-charge force fields. This can 3.30 kcal mol1, respectively. These are impressive values, but be explained by the fact that full charges are properly represented one must remain aware that they correspond to total energies by a spherical electrostatic potential as the charge is highly as opposed to those arising specifically from the electrostatic localised. Thus, point-charge implementations of electrostatics component of the force field. can model such a case with relative ease.

Table 2 The number of geometric minima predicted by a variety of force fields for serine and cysteine. Also given are the number of minima that the various force fields predicted but that were not represented in the set of minima generated by ab initio calculations, and the mean average deviations (MAD) for the molecular energies at each geometry

Serine AMBER94 MM2 MM3 MMFFs OPLS_2005 AMBER99 CHARMM27 AMOEBA Number of correct minima 21 19 15 27 23 21 19 34 Number of erroneous minima 1 2 3 6 3 7 2 7 Percentage erroneous 4.8 10.5 20.0 22.2 13.0 33.3 10.5 20.6 MAD [kJ mol1] 8.9 10.7 14.0 7.4 4.1 10.9 11.1 4.2

Cysteine AMBER94 MM2 MM3 MMFFs OPLS_2005 AMBER99 CHARMM27 AMOEBA Number of correct minima 23 23 21 28 25 29 21 44 Number of erroneous minima 1 1 3 7 3 5 5 1 Percentage erroneous 4.3 4.3 14.3 25.0 12 17.2 23.8 2.3 MAD [kJ mol1] 6.3 10.0 13.9 5.4 4.6 5.4 6.6 3.1

10380 | Phys. Chem. Chem. Phys., 2014, 16,10367--10387 This journal is © the Owner Societies 2014 View Article Online

PCCP Perspective

The accurate reproduction of the properties of water has Table 4 Dimerisation energies for formamide and glycyl dipeptide sys- 1 long plagued simulation. Being able to account for explicit tems computed at the SCF level of theory and SIBFA force field (kJ mol ). Also given are the energies resulting from individual multipole–multipole binding of water molecules, in addition to accurate modelling interaction ranks of bulk properties such as the dielectric constant are necessary features if one wishes to accurately simulate solvated systems. Formamide Glycyl dipeptide Difference, d

AMOEBA attempts to account for the lack of a universally ECoulomb(SCF) 22.1 14.3 7.8 acceptable water model by specifically parameterising the water ECoulomb(SIBFA) 20.9 13.2 7.7 molecule.182,183 Much like the generic AMOEBA force field, E00(SIBFA) 16.2 20.0 3.8 atomic multipole moment expansions up to l = 2 are generated E10(SIBFA) 1.6 5.8 7.4 using DMA. Polarisation is accounted for via induced atomic E11(SIBFA) 0.5 0.6 1.1 dipoles, and van der Waals interactions are modelled by a E20(SIBFA) 3.1 1.8 4.9 E21(SIBFA) 0.1 1.0 1.1 buffered 14–7 LJ potential. To the credit of the developers, this E22(SIBFA) 0.5 0.8 1.3 model is continually improved upon and reparameterised. Most recently,184 atomic multipole moments were generated for a water model at MP2 level with various basis sets in order developed by Vigne´-Maeder and Claverie.186 While AMOEBA to probe the reproduction of hydration free energies for a set of localises inducible dipole moments at atomic centres to account small molecules. Whilst the aug-cc-pVTZ basis set was found to for polarisation effects, SIBFA implements the procedure of give the best results, 6-311++G(2d,2p) gave a comparable accu- Garmer and Stevens187 to account for polarisabilities at hetero- racy at a much lower computational cost, and so is recom- atom lone pairs and bond barycentres. mended for larger simulations. Several impressive demonstrations of the performance of A direct comparison between an AMOEBA water model SIBFA relative to ab initio calculations are present in the parameterised at MP2/6-311++G(2d,2p) level and a widely used literature. For example,188 dimerisation energies of formamide point-charge water model reveals the benefits of atomic multi- and glycyl-dipeptide have been established at the SCF/MP2 level pole moment-parameterised electrostatics. A popular choice for of theory, and decomposed into constituent energetic compo- explicit solvation is the TIP3P model, which assigns a single nents by use of the Kitaura–Morokuma (KM) procedure. We point-charge to each atomic centre and implements a 12–6 LJ specifically present energies corresponding to the electrostatic function. Since the LJ functions differ between the two models, interaction between monomers in Table 4. Equivalent calcula- a ‘‘TIP3P-like’’ model was generated, which used the AMOEBA tions with SIBFA were carried out, with MMs parameterised at water model, but removed all multipole moments (static and the SCF/MP2 level of theory using the Gaussian-type basis set induced), replacing them with point-charges. TIP3P and TIP3P- derived by Stevens et al.,189 in excellent agreement with the KM like models were shown to be equivalent by comparison of results. Analysis of the additional components of the total inter- RDFs and bulk simulation properties. Deviation of the computed molecular interaction demonstrated the Coulombic portion to be hydration free energies from experimental benchmarks for a set dominant in defining the preference of formamide dimerisation of small molecules are given in Table 3 for both AMOEBA and relative to that of glycyl-dipeptide. Decomposition of the Coulom- TIP3P-like models. AMOEBA outperforms the TIP3P-like model bic interaction predicted by SIBFA additionally shows that the Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. quite spectacularly, with an RMSD (AMOEBA) B3 times smaller purely monopole–monopole interaction predicts the opposite than RMSD (TIP3P-like). preference. In fact, it is the monopole–dipole and monopole– quadrupole portions that recover the correct electrostatic inter- 4.2 SIBFA action energy. A similar set of experiments was also performed The Sum of Interactions Between Fragments Ab initio, or SIBFA between cis-NMA and alanyl-dipeptide, resulting in equivalent force field185 is another force field that has gained esteem conclusions (not shown). within the scientific community. In a similar vein to AMOEBA, Many of the studies for which SIBFA is utilised appear to the electrostatic term of the force field is split into two distinct focus mainly upon the solvation of metal ions190,191 or the components: permanent and inducible. However, subtle differ- interaction energies of metal ions with biomolecular ligands.192,193 It ences arise in the computation of permanent multipole should be noted that this approach is highly extensible to more moments, whereby a DMA protocol is called upon to generate complex biomolecular systems, such as metalloenzymes. This is multipole moment expansions (truncated at l = 2) localised demonstrated in a study that characterised the reasoning behind to atomic centres and bond barycentres in the procedure differential binding energies of a variety of ligands to thermolysin.194

Table 3 Free energies of hydration for a variety of molecules predicted by use of AMOEBA and TIP3P-like water potentials. Corresponding experimental values are also given. Energies are in kcal mol1

Ethylbenzene p-Cresol Isopropanol Imidazole Methylethyl sulfide Acetic acid RMSD AMOEBA 0.73 7.26 5.58 10.11 1.78 5.69 0.68 TIP3P-like 0.89 10.72 5.29 10.87 2.94 7.46 1.96 Experiment 0.70 6.10 4.70 9.6 1.50 6.70

This journal is © the Owner Societies 2014 Phys. Chem. Chem. Phys., 2014, 16,10367--10387| 10381 View Article Online

Perspective PCCP

However, this study primarily focused on the effects of polarisation not realistic. Upon integration into a larger system, the inten- and charge transfer. As such, despite the impressive results, we sity of the various multipole moments will decrease signifi- merely point it out as a demonstration of the power of SIBFA. cantly due to interaction between the formate moiety and the

Instead, we focus upon the lower scale simulations of these systems methylammonium species. Analysis of Etot for this scheme in which the role of electrostatics is explicitly defined. demonstrates the recovery from these overly emphasised multi- The cation Zn2+ is the second most common transition state pole moments by compensating through a smaller polarisation metal utilised in biocatalysis, and so the characteristics of its energy for these fully charged species. However, it should be interaction with protein subunits are obviously of importance. pointed out that all relative energies are correctly Focusing upon the interaction between Zn2+ and the most recovered by both schemes, with the exception of the Coulom- basic amino acid, glycine, Rogalewicz et al.195 characterised bic energy of one isomer as predicted by SIBFA-1. Nevertheless, two low-energy of the system at MP2/6-31G* level for all analysis of the total energies reveals that SIBFA performs far non-zinc atoms. The lowest energy isomers correspond to the better than conventional non-polarisable force fields. metal ion interacting with the carboxylate portion of the zwitter- Classical force fields have attempted to cling onto life by ionic glycine, whilst those of higher energy are characterised by giving the illusion of the accounting for anisotropic electronic the metal ion’s chelation of the amino nitrogen and carbonyl features by the addition of off-centre partial charges. The oxygen of neutral glycine. The ability of SIBFA to reproduce the authors of SIBFA appear to have hit the final nail in the coffin relative energies of the seven isomers formed was probed by of these classical force fields. These off-centre partial charges parameterisation of the multipole moments and polarisabil- are empirically localised, i.e. placement on expected lone pair ities by two differing approaches. The first of these, SIBFA-1, sites. However, these sites appear directly as a consequence of corresponds to extracting the multipole moments directly from the implementation of multipolar electrostatics.196 This is most Hartee–Fock wavefunctions of glycine or its corresponding evident when considering halogen bonding. Energy Decompo- zwitterion in their entireties. The second (SIBFA-2) decomposed sition Analysis (EDA) was performed on a system of halo- glycine into two fragments (methylamine and formic acid for benzenes interacting with two possible probes (the divalent the neutral form, protonated methylamine and formate for the cation Mg2+ and water) with the Reduced Variational Space zwitterionic), followed by the generation of multipole moments Analysis (RVS) and the aug-cc-pVTZ(-f) basis set. Fig. 8 demon- from HF wavefunctions and subsequent matrix rotation into strates the angular preference for the C–XP interactions for equilibrium geometries. This latter approach is obviously used the various halobenzenes (where X = F, Cl, Br or I and P = Mg2+, to probe the transferability of SIBFA multipole moments. The H or O). Analysis of the Coulombic portion of the EDA shows result that appears to be most prominent is the particularly that the cation preferentially interacts with ‘out of bond’-axis poor performance of the SIBFA-2 scheme at predicting electronic features in both the chloro- and bromobenzene Coulombic energies. However, this is by virtue of the fact that species as a result of the s-hole. It is also immediately evident these energies have been derived from fully charged species. that the multipolar electrostatics implemented in SIBFA As such, the Coulombic attraction between these fragments is directly superimposes on these curves. As such, the effect of Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25.

Fig. 8 Energetic profiles as a function of the C–XP angle for (from top left, clockwise) fluorobenzene-, bromobenzene-, and chlorobenzene-Mg2+ systems. Energy from the SIBFA force field is marked in red while Coulombic energy from the Reduced Variational Space (RVS) scheme is in blue. Energetic minima that are not situated at 1801 are hallmarks of halogen bonding. [Source: J. Compt. Chem., 2013, 34, 1125.]

10382 | Phys. Chem. Chem. Phys., 2014, 16,10367--10387 This journal is © the Owner Societies 2014 View Article Online

PCCP Perspective

the s-hole can be accounted for by multipole moments, with- 5. Conclusions out the need for fictitious empirical partial charges. Further decomposition of the SIBFA electrostatic energy was conducted, In general, the electrostatic interaction between atoms cannot whereby it was found that whilst the monopole–monopole be described accurately when using only one partial charge per interaction favours a simple y = 1801 angular conformation, it atom. Nevertheless, the ‘‘point-charge paradigm’’ continues to is largely due to the monopole–quadrupole that this conforma- dominate contemporary molecular simulation, with the vast tion is favoured. Preference for the quasi-perpendicular con- majority of practitioners either ignoring or being unaware of formations is dictated by the monopole–dipole interaction. The the scientific repercussions of this paradigm. It is important that the authors suggest that for a perfect superposition of the two computational community has the will to progress beyond this curves, higher-order multipole moments may be required. paradigm, especially because a solution is available: multipolar electrostatics. In fact, this solution has been available for a long time but an irreversible embrace of it remains absent, sadly. Journal 4.3 NEMO editors should also help overcoming this unacceptable status-quo. The final model we consider is that of the NonEmpirical Currently availability computing power offers the opportunity to MOdel, NEMO,197 used specifically to approximate intermole- finally make a step change in the modelling of electrostatics at a cular potentials. We only discuss the electrostatic portion of time when, certainly in the area of biomolecular simulation, the this potential, which is calculated by expanding Hartree–Fock trust of experimentalists towards computational predictions needs SCF molecular wavefunctions as a sum of atomic natural to be gained. Errors of a few kilojoules per mole can already be orbitals. A monopole moment between two atoms is defined enough to draw the wrong qualitative conclusion from a calcula- by calculation of an overlap integral between the basis func- tion. How the use of point-charges can then be perpetuated at short tions assigned to the atoms. Doing this for each pairwise and medium range is baffling. We hope that this perspective has interaction of atoms in the system, one generates a ‘‘monopole made a convincing case by collecting and reporting the evidence moment matrix’’, the diagonal elements of which correspond to against point-charges. The message should be clear but it now local atomic monopole moments. Higher-order multipole remains to be seen if a combination of powerful computers and moments may also be assigned by replacing the overlap integral scientific goodwill will finally realise a long overdue transition to aforementioned with a corresponding multipole moment integral. multipolar electrostatics. A more complete description of this technique is given else- where.198 However, for flexible molecules, molecular charge distributions evolve as a function of internal coordinates. The Acknowledgements authors overcome this by defining this charge distribution as a We thank Dr MS Shaik for his help in the preparation of a very function of the native molecular charge distribution and early draft of this manuscript. Gratitude is due to BBSRC for corresponding atomic polarisabilities. Using this scheme, providing a PhD studentship for SC, and we also thank EPSRC dimer energies and geometries for the four dominant minima for funding a PhD studentship for TJH with sponsorship from on a dimethoxymethane (DME)water PES were calculated AstraZeneca Ltd. and compared to SCF calculations. The nomenclature for the Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. DME conformations correspond to the orientation of the three dihedral bonds, with a, g and g0 representing antiperiplanar, References gauche and gauche0 geometries, respectively. Raman spectro- scopy of DME in water predicts an aga 4 agg0 4 aag 4 aaa 1 V. Hanninen, T. Salmi and L. Halonen, J. Phys. Chem. A, order of stability. This diverges from that predicted by NEMO, 2009, 113, 7133–7137. but the authors point out the non-equivalence of solvated DME 2 M. W. Mahoney and W. L. Jorgensen, J. Chem. Phys., 2000, and a DMEwater complex. This is a valid point since NEMO 112, 8910–8922. energies generally agree well with the SCF energies. 3 A. J. Stone, in The Theory of Intermolecular Forces, ed. More recently, higher level calibration of the NEMO J. S. Rowlinson, Clarendon Press, Oxford, 1st edn, 1996. potential was carried out based on CCSD(T) benzene dimer 4 A. Raval, S. Piana, M. P. Eastwood, R. O. Dror and energies.199 The authors found an excellent agreement with D. E. Shaw, Proteins: Struct., Funct., Bioinf., 2012, 80, experiment for benzene trimer geometries. However, they 2071–2079. neglected to compare quantitative data with other computa- 5 G. B. Arfken and H. J. Weber, Mathematical Methods for tional work, citing their calculations to have been carried out at Physicists, Academic Press, San Diego, California, USA, too high a level of theory for direct comparison with other lower 1995. levels of theory. Nevertheless, they reported a general agree- 6 P. L. A. Popelier and A. J. Stone, Mol. Phys., 1994, 82, ment with previous theoretical calculations, without mention- 411–425. ing specifics. More recent work200 has further developed upon 7 G. A. Cisneros, M. Karttunen, P. Ren and C. Sagui, Chem. the NEMO formalism, and improved the level of theory for Rev., 2014, 114, 779–814. parameterisation in addition to reporting improved perfor- 8 J. P. Ritchie and A. S. Copenhaver, J. Comput. Chem., 1995, mance of the model. 16, 777–789.

This journal is © the Owner Societies 2014 Phys. Chem. Chem. Phys., 2014, 16,10367--10387| 10383 View Article Online

Perspective PCCP

9 L. Joubert and P. L. A. Popelier, Phys. Chem. Chem. Phys., 37 P. A. Kollman, Acc. Chem. Res., 1977, 10, 365–371. 2002, 4, 4353–4359. 38 G. J. B. Hurst, P. W. Fowler, A. J. Stone and 10 D. Ghosh, D. Kosenkov, V. Vanovschi, C. F. Williams, A. D. Buckingham, Int. J. Quantum Chem., 1986, 29, J. M. Herbert, M. S. Gordon, M. S. Schmidt and 1223–1239. L. V. Slipchenko, J. Phys. Chem. A, 2010, 114, 12739–12754. 39 A. P. L. Rendel, G. B. Bacskay and N. S. Hush, Chem. Phys. 11 B. Guillot, J. Mol. Liq., 2002, 101, 219–260. Lett., 1985, 117, 400–413. 12 W. L. Jorgensen and J. Tirado-Rives, Proc. Natl. Acad. Sci. 40 T. W. Rowlands and K. Somasundram, Chem. Phys. Lett., U. S. A., 2005, 102, 6665–6670. 1987, 135, 549–552. 13 F. H. Stillinger and A. Rahman, J. Chem. Phys., 1974, 60, 41 P. Hobza and C. Sandorfy, J. Am. Chem. Soc., 1987, 109, 1545–1557. 1302–1307. 14 C. Millot and A. J. Stone, Mol. Phys., 1992, 77, 439–462. 42 G. Alagona and A. Tani, J. Chem. Phys., 1981, 74, 15 A. J. Stone, Chem. Phys. Lett., 1981, 83, 233–239. 3980–3988. 16 P. Cieplak, P. Kollman and T. Lybrand, J. Chem. Phys., 43 A. D. Buckingham and P. W. Fowler, J. Chem. Phys., 1983, 1990, 92, 6755–6760. 79, 6426–6428. 17 U. Niesar, G. Corongiu, E. Clementi, G. R. Kneller and 44 A. D. Buckingham, P. W. Fowler and J. M. Hutson, Chem. D. K. Bhattacharya, J. Phys. Chem., 1990, 94, 7949–7956. Rev., 1988, 88, 963–988. 18 P. Barnes, J. Finney, J. Nicholas and J. Quinn, Nature, 1979, 45 A. D. Buckingham and P. W. Fowler, Can. J. Chem., 1985, 282, 459–464. 63, 2018–2025. 19 C. Millot, J. C. Soetens, M. T. C. M. Costa, M. P. Hodges 46 J.-H. Lii and N. L. Allinger, J. Phys. Org. Chem., 1994, 7, and A. J. Stone, J. Phys. Chem. A, 1998, 102, 754–770. 591–609. 20 M. P. Hodges, A. J. Stone and S. S. Xantheas, J. Phys. Chem. 47 J.-H. Lii and N. L. Allinger, J. Comput. Chem., 1998, 19, A, 1997, 101, 9163–9168. 1001–1016. 21 R. S. Fellers, C. Leforestier, L. B. Braly, M. G. Brown and 48 P. Cieplak, J. Caldwell and P. Kollman, J. Comput. Chem., R. J. Saykally, Science, 1999, 284, 945–948. 2001, 22, 1048–1057. 22 F. N. Keutsch and R. J. Saykally, Proc. Natl. Acad. Sci. U. S. 49 J. Kong and J.-M. Yan, Int. J. Quantum Chem., 1993, 46, A., 2001, 98, 10533–10540. 239–255. 23 S. Y. Liem, P. L. A. Popelier and M. Leslie, Int. J. Quantum 50 M. S. Shaik, M. Devereux and P. L. A. Popelier, Mol. Phys., Chem., 2004, 99, 685–694. 2008, 106, 1495–1510. 24 P. L. A. Popelier and E´. A. G. Bre´mond, Int. J. Quantum 51 P. Ren, C. Wu and J. W. Ponder, J. Chem. Theory Comput., Chem., 2009, 109, 2542–2553. 2011, 7, 3143–3161. 25 P. L. A. Popelier and F. M. Aicken, ChemPhysChem, 2003, 4, 52 J. W. Ponder, C. Wu, V. S. Pande, J. D. Chodera, 824–829. M. J. Schnieders, I. Haque, D. L. Mobley, 26 P. L. A. Popelier, in Quantum Chemical Topology: on Bonds, D. S. Lambrecht, R. A. J. DiStasio, M. Head-Gordon, Potentials. Structure, Bonding. Intermolecular Forces and G. N. I. Clark, M. E. Johnson and T. Head-Gordon, Clusters, ed. D. J. Wales, Springer, Heidelberg, Germany, J. Phys. Chem. B, 2010, 114, 2549–2564. Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. 2005, pp. 1–56. 53 R. J. Wheatley and S. L. Price, Mol. Phys., 1990, 71, 27 R. F. W. Bader, Atoms in Molecules. A Quantum Theory, 1381–1404. Oxford Univ. Press, Oxford, Great Britain, 1990. 54 M. A. Freitag, M. S. Gordon, J. H. Jensen and W. J. Stevens, 28 P. L. A. Popelier, Atoms in Molecules. An Introduction, J. Chem. Phys., 2000, 112, 7300–7306. Pearson Education, London, Great Britain, 2000. 55 P. L. A. Popelier and D. S. Kosov, J. Chem. Phys., 2001, 114, 29 S. Y. Liem and P. L. A. Popelier, J. Chem. Theory Comput., 6539–6547. 2008, 4, 353–365. 56 S. Liem and P. L. A. Popelier, J. Chem. Phys., 2003, 119, 30 E. R. Batista, S. S. Xantheas and H. Jonsson, J. Chem. Phys., 4560–4566. 2000, 112, 3285–3292. 57 M. S. Shaik, S. Y. Liem, Y. Yuan and P. L. A. Popelier, Phys. 31 O. Engkvist and A. J. Stone, J. Chem. Phys., 2000, 112, Chem. Chem. Phys., 2010, 12, 15040–15055. 6827–6833. 58 S. Y. Liem, M. S. Shaik and P. L. A. Popelier, J. Phys. Chem. 32 M. C. Foster and G. E. Ewing, J. Chem. Phys., 2000, 112, B, 2011, 115, 11389–11398. 6817–6826. 59 S. Y. Liem and P. L. A. Popelier, Phys. Chem. Chem. Phys., 33 R. Taylor, O. Kennard and W. Versichel, J. Am. Chem. Soc., 2014, 16, 4122–4134. 1983, 105, 5761–5766. 60 S. K. Panigrahi and G. R. Desiraju, Proteins: Struct., Funct., 34 J. P. M. Lommerse, S. L. Price and R. Taylor, J. Comput. Bioinf., 2007, 67, 128–141. Chem., 1997, 18, 757–774. 61 G. G. R. Desiraju and T. Steiner, The weak hydrogen bond: in 35 I. Nobeli, S. L. Price, J. P. M. Lommerse and R. Taylor, structural chemistry and biology, Oxford University Press on J. Comput. Chem., 1997, 18, 2060–2074. Demand, 2001. 36 H. Umeyama and K. Morokuma, J. Am. Chem. Soc., 1977, 62 T. Steiner and G. Koellner, J. Mol. Biol., 2001, 305, 535–557. 99, 1316–1332. 63 M. Levitt and M. F. Perutz, J. Mol. Biol., 1988, 201, 751–754.

10384 | Phys. Chem. Chem. Phys., 2014, 16,10367--10387 This journal is © the Owner Societies 2014 View Article Online

PCCP Perspective

64 S. Melandri, Phys. Chem. Chem. Phys., 2011, 13, 95 X. Zheng, C. Wu, J. W. Ponder and G. R. Marshall, J. Am. 13901–13911. Chem. Soc., 2012, 134, 15970–15978. 65 P. Auffinger, S. Louise-May and E. Westhof, J. Am. Chem. 96 S. L. Price, Int. Rev. Phys. Chem., 2008, 27, 541–568. Soc., 1996, 118, 1181–1189. 97 J. P. M. Lommerse, W. D. S. Motherwell, H. L. Ammon, 66 A. Maris, S. Melandri, M. Miazzi and F. Zerbetto, J. D. Dunitz, A. Gavezzotti, D. W. M. Hofmann, ChemPhysChem, 2008, 9, 1303–1308. F. J. J. Leusen, W. T. M. Mooij, S. L. Price, B. Schweizer, 67 N. Ramasubbu, R. Parthasarathy and P. Murray-Rust, M. U. Schmidt, B. P. van Eijck, P. Verwer and J. Am. Chem. Soc., 1986, 108, 4308–4314. D. E. Williams, Acta Crystallogr., Sect. B: Struct. Sci., 2000, 68 S. Tsuzuki, A. Wakisaka, T. Ono and T. Sonoda, Chem. – Eur. J., 56, 697–714. 2012, 18, 951–960. 98 W. D. S. Motherwell, H. L. Ammon, J. D. Dunitz, 69 H. Torii and M. Yoshida, J. Comput. Chem., 2010, 31, A. Dzyanchenko, P. Erk and A. Gavezzotti, et al, Acta 107–116. Crystallogr., Sect. B: Struct. Sci., 2002, 58, 647–661. 70 I. Alkorta, F. Blanco, M. Solimannejad and J. Elguero, 99 G. M. Day, W. D. S. Motherwell, H. L. Ammon, J. Phys. Chem. A, 2008, 112, 10856–10863. S. X. M. Boerrigter, R. G. Della Valle, E. Venuti, 71 P. Politzer, J. S. Murray and T. Clark, Phys. Chem. Chem. A. Dzyabchenko, J. D. Dunitz, B. Schweizer, B. P. van Eijck, Phys., 2010, 12, 7748–7757. P. Erk, J. C. Facelli, V. E. Bazterra, M. B. Ferraro, 72 M. A. A. Ibrahim, J. Comput. Chem., 2011, 32, 2564–2574. D. W. M. Hofmann, F. J. J. Leusen, C. Liang, 73 Y. Lu, T. Shi, Y. Wang, H. Yang, X. Yan, X. Luo, H. Jiang C. C. Pantelides, P. G. Karamertzanis, S. L. Price, and W. Zhu, J. Med. Chem., 2009, 52, 2854–2862. T. C. Lewis, H. Nowell, A. Torrisi, H. A. Scheraga, 74 A. J. Stone, J. Am. Chem. Soc., 2013, 135, 7005–7009. Y. A. Arnautova, M. U. Schmidt and P. Verwer, Acta Crystal- 75 Y. Duan, C. Wu, S. Chowdhury, M. C. Lee, G. Xiong, logr., Sect. B: Struct. Sci., 2005, 61, 511–527. W. Zhang, R. Yang, P. Cieplak, R. Luo, T. Lee, 100 G. M. Day, T. G. Cooper, A. J. Cruz-Cabeza, K. E. Hejczyk, J. Caldwell, J. Wang and P. Kollman, J. Comput. Chem., H. L. Ammon, S. X. M. Boerrigter, J. S. Tan, R. G. Della 2003, 24, 1999–2012. Valle, E. Venuti, J. Jose, S. R. Gadre, G. R. Desiraju, 76 D. Jiao, C. King, A. Grossfield, T. A. Darden and P. Ren, T. S. Thakur, B. P. van Eijck, J. C. Facelli, V. E. Bazterra, J. Phys. Chem. B, 2006, 110, 18553–18559. M. B. Ferraro, D. W. M. Hofmann, M. A. Neumann, 77 A. Grossfield, P. Ren and J. W. Ponder, J. Am. Chem. Soc., F. J. J. Leusen, J. Kendrick, S. L. Price, A. J. Misquitta, 2003, 125, 15671–15682. P. G. Karamertzanis, G. W. A. Welch, H. A. Scheraga, 78 A. Grossfield, J. Chem. Phys., 2005, 122, 024506–024510. Y. A. Arnautova, M. U. Schmidt, J. van de Streek, 79 G. Chessari, C. A. Hunter, C. M. R. Low, M. J. Packer, A. K. Wolf and B. Schweizer, Acta Crystallogr., Sect. B: J. G. Vinter and C. Zonta, Chem. – Eur. J.,2002,8, 2860–2867. Struct. Sci., 2009, 65, 107–125. 80 D. E. Williams and T. L. Starr, Comput. Chem., 1977, 1, 101 P. Verwer and F. J. J. Leusen, in Reviews in Computational 173–177. Chemistry, ed. K. B. Lipkowitz and D. B. Boyd, New York, 81 S. L. Price, Chem. Phys. Lett., 1985, 114, 359–364. Wiley-VCH, 1998, pp. 327–365. 82 C. A. Hunter and J. K. M. Sanders, J. Am. Chem. Soc., 1990, 102 H. Karfunkel, F. J. Leusen and R. Gdanitz, J. Comput.-Aided Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. 112, 5525–5534. Mater. Des., 1994, 1, 177–185. 83 S. Grimme, Angew. Chem., Int. Ed., 2008, 47, 3430–3434. 103 D. J. Willock, S. L. Price, M. Leslie and C. R. Catlow, 84 C. D. Sherrill, Acc. Chem. Res., 2012, 46, 1020–1028. J. Comput. Chem., 1995, 16, 628. 85 B. K. Mishra and N. Sathyamurthy, J. Phys. Chem. A, 2007, 104 D. A. Bardwell, C. S. Adjiman, Y. A. Arnautova, 111, 2139–2147. E. Bartashevich, S. X. M. Boerrigter, D. E. Braun, 86 P. W. Fowler and A. D. Buckingham, Chem. Phys. Lett., A. J. Cruz-Cabeza, G. M. Day, R. G. Della Valle, 1991, 176, 11–18. G. R. Desiraju, B. P. van Eijck, J. C. Facelli, M. B. Ferraro, 87 S. L. Price and A. J. Stone, J. Chem. Phys., 1987, 86, D. Grillo, M. Habgood, D. W. M. Hofmann, F. Hofmann, 2859–2868. K. V. J. Jose, P. G. Karamertzanis, A. V. Kazantsev, 88 U. Koch and E. Egert, J. Comput. Chem., 1995, 16, 937–944. J. Kendrick, L. N. Kuleshova, F. J. J. Leusen, A. V. Maleev, 89 C. A. Hunter and X.-J. Lu, J. Mol. Biol., 1997, 265, 603–619. A. J. Misquitta, S. Mohamed, R. J. Needs, M. A. Neumann, 90 C. A. Hunter, J. Singh and J. M. Thornton, J. Mol. Biol., D. Nikylov, A. M. Orendt, R. Pal, C. C. Pantelides, 1991, 218, 837–846. C. J. Pickard, L. S. Price, S. L. Price, H. A. Scheraga, 91 C. A. Hunter, Chem. Soc. Rev., 1994, 23, 101–109. J. van de Streek, T. S. Thakur, S. Tiwari, E. Venuti and 92 G. Hill, G. Forde, N. Hill, W. A. Lester Jr., W. Andrzej I. K. Zhitkov, Acta Crystallogr., Sect. B: Struct. Sci., 2011, 67, Sokalski and J. Leszczynski, Chem. Phys. Lett., 2003, 381, 535–551. 729–732. 105 S. L. Price, M. Leslie, G. W. A. Welch, M. Habgood, 93 M. Tafipolsky and B. Engels, J. Chem. Theory Comput., L. S. Price, P. G. Karamertzanis and G. M. Day, Phys. Chem. 2011, 7, 1791–1803. Chem. Phys., 2010, 12, 8478–8490. 94 B. Jeziorski, R. Moszynski and K. Szalewicz, Chem. Rev., 106 G. M. Day, W. D. S. Motherwell and W. Jones, Phys. Chem. 1994, 94, 1887–1930. Chem. Phys., 2007, 9, 1693–1704.

This journal is © the Owner Societies 2014 Phys. Chem. Chem. Phys., 2014, 16,10367--10387| 10385 View Article Online

Perspective PCCP

107 T. G. Cooper, K. E. Hejczyk, W. Jones and G. M. Day, 136 R. F. W. Bader and P. M. Beddall, J. Chem. Phys., 1972, 56, J. Chem. Theory Comput., 2008, 4, 1795–1805. 3320–3329. 108 B. P. v. Eijck, J. Comput. Chem., 2002, 23, 805–815. 137 K. E. Laidig, J. Phys. Chem., 1993, 97, 12760–12767. 109 A. V. Kazantsev, P. G. Karamertzanis, C. S. Adjiman, 138 C. E. Whitehead, C. M. Breneman, N. Sukumar and C. C. Pantelides, S. L. Price, P. T. A. Galek, G. M. Day M. D. Ryan, J. Comput. Chem., 2003, 24, 512–529. and A. J. Cruz-Cabeza, Int. J. Pharm., 2011, 218, 168–178. 139 C. M. Breneman, T. R. Thompson, M. Rhem and M. Dung, 110 M. A. Neumann and M. A. Perrin, J. Phys. Chem. B, 2005, Comput. Chem., 1995, 19, 161–179. 109, 15531–15541. 140 P. L. A. Popelier and F. M. Aicken, ChemPhysChem, 2003, 4, 111 M. A. Neumann, J. Phys. Chem. B, 2008, 112, 9810–9829. 824–829. 112 W. T. M. Mooij and F. J. J. Leusen, Phys. Chem. Chem. Phys., 141 P. L. A. Popelier, M. Devereux and M. Rafat, Acta Crystallogr., 2001, 3, 5063–5066. Sect. A: Found. Crystallogr., 2004, 60,427–433. 113 G. M. Day, J. Chisholm, N. Sham, W. D. S. Motherwell and 142 Y. Yuan, M. J. L. Mills and P. L. A. Popelier, J. Comput. W. Jones, Cryst. Growth Des., 2004, 4, 1327–1340. Chem., 2014, 35, 343–359. 114 S. R. Cox, L.-Y. Hsu and D. E. Williams, Acta Crystallogr., 143 T. Koritsanszky, A. Volkov and P. Coppens, Acta Crystallogr., Sect. A: Found. Crystallogr., 1981, 37, 293–301. Sect. A: Found. Crystallogr., 2002, 58,464–472. 115 D. E. Williams and S. R. Cox, Acta Crystallogr., Sect. B: 144 M. J. L. Mills and P. L. A. Popelier, Comput. Theor. Chem., Struct. Sci., 1984, 40, 404–417. 2011, 975, 42–51. 116 D. E. Williams, J. Mol. Struct., 1999, 485–486, 321–347. 145 M. J. L. Mills and P. L. A. Popelier, Theor. Chem. Acc., 2012, 117 D. E. Williams, J. Comput. Chem., 2001, 22, 1–20. 131, 1137–1153. 118 D. E. Williams, J. Comput. Chem., 2001, 22, 1154–1166. 146 S. M. Kandathil, T. L. Fletcher, Y. Yuan, J. Knowles and 119 S. L. Mayo, B. D. Olafson and W. A. Goddard, J. Phys. P. L. A. Popelier, J. Comput. Chem., 2013, 34, 1850–1861. Chem., 1990, 94, 8897–8909. 147 C. Jelsch, V. Pichon-Pesme, C. Lecomte and A. Aubry, Acta 120 M. J. Hwang, T. P. Stockfisch and A. T. Hagler, J. Am. Chem. Crystallogr., Sect. D: Biol. Crystallogr., 1998, 54, 1306–1318. Soc., 1994, 116, 2515. 148 N. K. Hansen and P. Coppens, Acta Crystallogr., Sect. A: 121 J. R. Maple, M. J. Hwang, T. P. Stockfisch, U. Dinur, Found. Crystallogr., 1978, 34, 909–921. M. Waldman, C. S. Ewig and A. T. Hagler, J. Comput. 149 N. Muzet, B. Guillot, C. Jelsch, E. Howard and C. Lecomte, Chem., 1994, 15, 162–182. Proc. Natl. Acad. Sci. U. S. A., 2003, 100, 8742–8747. 122 Z. W. Peng, C. S. Ewig, M. J. Hwang, M. Waldman and 150 A. Volkov, T. Koritsanszky and P. Coppens, Chem. Phys. A. T. Hagler, J. Phys. Chem. A, 1997, 101, 7243–7252. Lett., 2004, 391, 170–175. 123 H. Sun, J. Phys. Chem. B, 1998, 102, 7338–7364. 151 F. Allen, Acta Crystallogr., Sect. B: Struct. Sci., 2002, 58, 124 V. Marcon and G. Raos, J. Phys. Chem. B, 2004, 108, 380–388. 18053–18064. 152 A. Volkov, X. Li, T. Koritsanszky and P. Coppens, J. Phys. 125 S. Brodersen, S. Wilke, F. J. J. Leusen and G. Engel, Phys. Chem. A, 2004, 108, 4283–4300. Chem. Chem. Phys., 2003, 5, 4923–4931. 153 M. Woin´ska and P. M. Dominiak, J. Phys. Chem. A, 2011, 126 J. D. Yesselman, D. J. Price, J. L. Knight and C. L. Brooks, 117, 1535–1547. Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. J. Comput. Chem., 2012, 33, 189–202. 154 P. M. Dominiak, A. Volkov, X. Li, M. Messerschmidt and 127 R. F. W. Bader, A. Larouche, C. Gatti, M. T. Carroll, P. Coppens, J. Chem. Theory Comput., 2007, 3, 232–247. P. J. MacDougall and K. B. Wiberg, J. Chem. Phys., 1987, 155 V. Pichon-Pesme, C. Lecomte and H. Lachekar, J. Phys. 87, 1142–1152. Chem., 1995, 99, 6242–6250. 128 R. F. W. Bader, T. A. Keith, K. M. Gough and K. E. Laidig, 156 K. N. Jarzembska and P. M. Dominiak, Acta Crystallogr., Mol. Phys., 1992, 75, 1167–1189. Sect. A: Found. Crystallogr., 2012, 68, 139–147. 129 S. L. Price, C. H. Faerman and C. W. Murray, J. Comput. 157 P. Dominiak, E. Espinosa and J. A´ngya´n, Intermolecular Chem., 1991, 12, 1187–1197. Interaction Energies from Experimental Charge Density 130 C. H. Faerman and S. L. Price, J. Am. Chem. Soc., 1990, 112, Studies, in Modern Charge-Density Analysis, ed. C. Gatti and 4915–4926. P. Macchi, Springer, Netherlands, 2012, pp. 387–433. 131 W. T. M. Mooij, F. B. van Duijneveldt, J. G. C. M. van 158 K. Y. Sanbonmatsu and C. S. Tung, J. Struct. Biol., 2007, Duijneveldt-vande Rijdt and B. P. van Eijck, J. Phys. Chem. 157, 470–480. A, 1999, 103, 9872. 159 L. Kale´, R. Skeel, M. Bhandarkar, R. Brunner, A. Gursoy, 132 R. Kumar, F.-F. Wang, G. R. Jenness and K. D. Jordan, N. Krawetz, J. Phillips, A. Shinozaki, K. Varadarajan and J. Chem. Phys., 2010, 132, 014309–014312. K. Schulten, J. Comput. Phys., 1999, 151, 283–312. 133 A. J. Stone and M. Alderton, Mol. Phys., 1985, 56, 160 A. Y. Toukmaji and J. A. Board Jr., Comput. Phys. Commun., 1047–1064. 1996, 95, 73–92. 134 W. A. Sokalski and R. A. Poirier, Chem. Phys. Lett., 1983, 98, 161 S. L. Price and A. J. Stone, J. Chem. Soc., Faraday Trans., 86–92. 1992, 88, 1755–1763. 135 D. S. Kosov and P. L. A. Popelier, J. Chem. Phys., 2000, 113, 162 P. Kdzierski and W. A. Sokalski, J. Comput. Chem., 2001, 22, 3969–3974. 1082–1097.

10386 | Phys. Chem. Chem. Phys., 2014, 16,10367--10387 This journal is © the Owner Societies 2014 View Article Online

PCCP Perspective

163 N. Plattner and M. Meuwly, Biophys. J., 2008, 94, 183 P. Ren and J. W. Ponder, J. Phys. Chem. B, 2004, 108, 2505–2515. 13427–13437. 164 U. Koch, P. L. A. Popelier and A. J. Stone, Chem. Phys. Lett., 184 Y. Shi, C. Wu, J. W. Ponder and P. Ren, J. Comput. Chem., 1995, 238, 253–260. 2011, 32, 967–977. 165 N. Plattner and M. Meuwly, J. Mol. Model., 2009, 15, 185 N. Gresh, G. A. Cisneros, T. A. Darden and J.-P. Piquemal, 687–694. J. Chem. Theory Comput., 2007, 3, 1960–1986. 166 P. G. Karamertzanis and S. L. Price, J. Chem. Theory 186 F. Vigne´-Maeder and P. Claverie, J. Chem. Phys., 1988, 88, Comput., 2006, 2, 1184–1199. 4934–4948. 167 C. Ha¨ttig, Chem. Phys. Lett., 1997, 268, 521–530. 187 D. R. Garmer and W. J. Stevens, J. Phys. Chem., 1989, 93, 168 P. L. A. Popelier and A. J. Stone, Mol. Phys., 1994, 82, 411–425. 8263–8270. 169 P. L. A. Popelier, A. J. Stone and D. J. Wales, Faraday 188 N. Gresh, H. Guo, D. R. Salahub, B. P. Roques and Discuss., 1994, 97, 243–264. S. A. Kafafi, J. Am. Chem. Soc., 1999, 121, 7885–7894. 170 M. Leslie, Mol. Phys., 2008, 106, 1567–1578. 189 W. J. Stevens, H. Basch and M. Krauss, J. Chem. Phys., 1984, 171 D. M. Elking, L. Perera, R. Duke, T. Darden and 81, 6026–6033. L. G. Pedersen, J. Comput. Chem., 2010, 31, 2702–2713. 190 N. Gresh and J.ˇ Sponer, J. Phys. Chem. B, 1999, 103, 172 M. J. L. Mills and P. L. A. Popelier, submitted. 11415–11427. 173 C. Sagui, L. G. Pedersen and T. A. Darden, J. Chem. Phys., 191 N. Gresh, J. E.ˇ Sponer, N.ˇ Spacˇkova´, J. Leszczynski and 2004, 120, 73–87. J.ˇ Sponer, J. Phys. Chem. B, 2003, 107, 8669–8681. 174 J. C. Wu, J.-P. Piquemal, R. Chaudret, P. Reinhardt and 192 G. Tiraboschi, M.-C. Fournie´-Zaluski, B.-P. Roques and P. Ren, J. Chem. Theory Comput., 2010, 6, 2059–2070. N. Gresh, J. Comput. Chem., 2001, 22, 1038–1047. 175 J.-P. Piquemal, L. Perera, G. A. Cisneros, P. Ren, 193 N. Gresh and D. R. Garmer, J. Comput. Chem., 1996, 17, L. G. Pedersen and T. A. Darden, J. Chem. Phys., 2006, 1481–1495. 125, 054511–054517. 194 N. Gresh and B.-P. Roques, Biopolymers, 1997, 41, 145–164. 176 J. W. Ponder and D. A. Case, Force Fields for Protein 195 F. Rogalewicz, G. Ohanessian and N. Gresh, J. Comput. Simulations, in Advances in Protein Chemistry, ed. Chem., 2000, 21, 963–973. D. Valerie, Academic Press, 2003, pp. 27–85. 196 K. E. Hage, J.-P. Piquemal, Z. Hobaika, R. G. Maroun and 177 P. Ren and J. W. Ponder, J. Comput. Chem., 2002, 23, N. Gresh, J. Comput. Chem., 2013, 34, 1125–1135. 1497–1506. 197 O. Engkvist, P.-O. Åstrand and G. Karlstro¨m, J. Phys. Chem., 178 T. Liang and T. R. Walsh, Phys. Chem. Chem. Phys., 2006, 8, 1996, 100, 6950–6957. 4410–4419. 198 On the evaluation of intermolecular potentials, in Proceedings 179 T. Liang and T. R. Walsh, Mol. Simul., 2007, 33, 337–342. of the 5th Seminar on Computational Methods in Quantum 180 J. Kaminsky´ and F. Jensen, J. Chem. Theory Comput., 2007, Chemistry, ed. G. Karlstro¨m, Max-Planck-Institut fu¨r Physik 3, 1774–1788. und Astrophysik, Groningen, 1981. 181 T. D. Rasmussen, P. Ren, J. W. Ponder and F. Jensen, 199 O. Engkvist, P. Hobza, H. L. Selzle and E. W. Schlag, Int. J. Quantum Chem., 2007, 107, 1390–1395. J. Chem. Phys., 1999, 110, 5758–5762. Published on 16 April 2014. Downloaded by University of Cambridge 21/11/2016 20:56:25. 182 P. Ren and J. W. Ponder, J. Phys. Chem. B, 2003, 107, 200 A. B. Holt, J. Bostro¨m, G. Karlstro¨m and R. Lindh, 5933–5947. J. Comput. Chem., 2010, 31, 1583–1591.

This journal is © the Owner Societies 2014 Phys. Chem. Chem. Phys., 2014, 16,10367--10387| 10387 WWW.C-CHEM.ORG FULL PAPER

Prediction of Conformationally Dependent Atomic Multipole Moments in Carbohydrates Salvatore Cardamone[a,b] and Paul L. A. Popelier*[a,b]

The conformational flexibility of carbohydrates is challenging In this work, we demonstrate that the carbohydrates erythrose within the field of computational chemistry. This flexibility and threose are amenable to the above methodology. We causes the electron density to change, which leads to fluctuat- investigate how kriging models respond when the training ing atomic multipole moments. Quantum Chemical Topology ensemble incorporating multiple energy minima and their (QCT) allows for the partitioning of an “atom in a molecule,” environment in conformational space. Additionally, we evalu- thus localizing electron density to finite atomic domains, which ate the gains in predictive capacity of our models as the size permits the unambiguous evaluation of atomic multipole of the training ensemble increases. We believe this approach moments. By selecting an ensemble of physically realistic con- to be entirely novel within the field of carbohydrates. For a formers of a chemical system, one evaluates the various multi- modest training set size of 600, more than 90% of the external pole moments at defined points in configuration space. The test configurations have an error in the total (predicted) elec- 21 subsequent implementation of the machine learning method trostatic energy (relative to ab initio) of maximum 1 kJ mol kriging delivers the evaluation of an analytical function, which for open chains and just over 90% an error of maximum 4 kJ 21 smoothly interpolates between these points. This allows for the mol for rings. VC 2015 The Authors. Journal of Computational prediction of atomic multipole moments at new points in con- Chemistry Published by Wiley Periodicals, Inc. formational space, not trained for but within prediction range. DOI: 10.1002/jcc.24215

Introduction tion. Equivalently, structural motifs such as helices and sheets are stabilised by vast intermolecular bonding net- The computational analysis of biochemical systems is largely works. This dependence on structural motifs necessarily biased toward peptides and proteins. One may, therefore, be for- influences the properties of other atoms by constraining given for assuming that they possess a near monopoly in bio- them to states that do not necessarily coincide with those of chemistry. However, it is only by complexation with additional the unfolded non-native state. Although carbohydrates do molecular species that peptides and proteins are able to accom- form structural motifs, they tend to remain flexible under plish their myriad roles within biological systems.[1] For example, standard biological conditions, and do not typically assemble eukaryotic proteins are subject to post-translational modification, into the stable secondary structures, which polypeptides do. a process in which various carbohydrate sequences are attached 2. Carbohydrates exhibit much more conformational free- to the protein. Vital chemical entities, such as enzymatic cofactors dom than peptides. Electronic quantities vary as a func- (e.g. ATP, NADP, etc.) and nucleotides, are entirely dependent tion of the conformational degrees of freedom of a upon the existence of products from the pentose phosphate molecular species[2]. As such, the electronic quantities of pathway, which are synthesised from carbohydrates. In fact, the a more flexible conformation will vary to a greater extent phosphate pathway would be unable to run at all if it were not for than those of a less flexible one. the energy derived from carbohydrates, which undergo glycolysis 3. Many carbohydrate species exhibit a preference for axial and are subsequently passed into the tricarboxcylic acid cycle. rather than equatorial arrangements of electron-rich sub- Many biochemical force fields are parameterised by exhaus- stituents on an anomeric carbon. The origin of this tively sampling quantities arising from peptide atom types. One cannot simply use protein atom types as a direct substitu- [a] S. Cardamone, P. L. A. Popelier tion for carbohydrate atom types for a number of reasons: Manchester Institute of Biotechnology (MIB), 131 Princess Street, Manchester, M1 7DN, Great Britain E-mail: [email protected] 1. Peptides possess features that are generally absent in carbo- [b] S. Cardamone, P. L. A. Popelier hydrates, most prominently the presence of nitrogen and School of Chemistry, University of Manchester, Oxford Road, Manchester, the ability to form structural motifs. Both of these features M13 9PL, Great Britain are particularly perturbative to electrostatic quantities associ- E-mail: [email protected] This is an open access article under the terms of the Creative Commons ated with constituent atoms. For example, nitrogen pos- Attribution, which permits use, distribution and reproduction in any medium, sesses a significant quadrupole moment, which influences provided the original work is properly cited and is not used for commercial other atomic electrostatic quantities anisotropically, and can- purposes. VC 2015 The Authors. Journal of Computational Chemistry Published by Wiley not be captured by the standard point charge approxima- Periodicals, Inc.

Journal of Computational Chemistry 2015, 36, 2361–2373 2361 FULL PAPER WWW.C-CHEM.ORG

anomeric effect is not entirely clear,[3] but the energetics particularly amenable to hydration, yet a partial charge that arise from it must be captured by a force field which approximation to electrostatics cannot recover the directional deals with carbohydrates. Similarly, the exo-anomeric preferences of hydrogen bond formation without the addition effect, which deals with substituents linked to an anomeric of extra point charges at non-nuclear positions. The isotropic oxygen, forces a separate conformational preference. It is nature of partial charge electrostatics is readily overcome by not within the scope of this work to deal with the anome- use of a multipole moment description of electrostatics, which ric and exo-anomeric effects in any great detail, and so we naturally describes anisotropic electronic features such as lone [4] refer the interested reader to a more rigorous overview. pairs. The benefits of such a multipole moment description 4. Similar to the anomeric effect, the gauche effect can arise over their partial charge equivalents has been systematically within a number of carbohydrate species and bias rotamer demonstrated over the past 20 years in many dozens of [5] preferences. This effect has an ambiguous origin. Research papers, recently reviewed.[17] These benefits are not necessarily has suggested it to arise from hyperconjugation or solvent outweighed by the common misconception that multipole effects. In short, it is the preference of a gauche rotamer over moment implementations are computationally expensive rela- an antirotamer, where the latter would be stereoelectroni- tive to their point charge counterparts. The long-range nature [6] cally preferable. Work by Kirschner and Woods proposed of point charge electrostatics, O(r21), relative to higher order that the gauche effect results from solvent effects, which are multipole moments [dipole-dipole interactions, for example, not important in our work since we are explicitly dealing with die off as O(r23)], means this is not strictly true. Point charges gas phase molecules. However, for a carbohydrate force field require a larger interaction cutoff radius relative to higher to be of use, this effect must be accounted for. order multipole moments, and, therefore, form the bottleneck in electrostatic energy evaluation. Given proper handling (e.g. The conformational freedom of carbohydrates renders them parallel implementation), the computational overheads associ- somewhat troublesome for experimentalists, as they prove to be ated with multipole moment electrostatics can be managed. highly difficult to characterize by conventional high-resolution In the remainder of this article, we shall demonstrate a structural determination techniques,[7] particularly X-ray crystallog- novel means for modelling electrostatics by use of a multipole raphy. To be precise, carbohydrates tend to be difficult to crystal- moment expansion centred upon each atomic nucleus. The lize, which is problematic because X-ray diffraction techniques are techniques we present will inherently capture the conforma- a valuable source of structural information. As such, the structural tional dependence of these multipole moments. characterization of carbohydrates rests with a few experimental techniques, and subsequent validation by computational means. This required harmony between experiment and computation is Methodology vastly important, and has been recently explored,[8,9] and so Atomic partitioning proves to be a fruitful avenue for development. The development of molecular orbital theory largely caused a Classical force fields such as OPLS-AA, CHARMm, GROMOS, and decline in chemical understanding of the theoretical descrip- AMBER appear to have characterized carbohydrates as “secondary tion of molecular systems. The fact that each electron occupies molecular species” relative to their peptide counterparts. As such, these parameterizations resemble “bolt-on” components. How- a molecular orbital dispersed across the spatial extent of the ever, force fields that are specifically tailored for carbohydrates do molecule gave rise to a valid query: why do functional groups exist and have proven successful. GLYCAM[10] is perhaps the most impart some property, such as reactivity, to the molecule, prominent of these force fields, and has been ported to AMBER. when the distribution of electrons throughout the system is More recently, the advent of DL_FIELD has facilitated the use of essentially no different to the inert species? Surely, there must GLYCAM parameters within DL_POLY 4.0.[11] GLYCAM has under- be some localization of electrons to the functional group, gone extensive validation in an attempt to demonstrate its effi- which permits subsequent functionality. If this is not the case, cacy. Several studies have focused upon its ability to reproduce then even the most fundamental chemical concepts, such as conformer populations in explicitly solvated molecular dynamics those of nucleophiles and electrophiles, have no theoretical (MD) simulations,[6,12,13] which is obviously important owing to the grounding. These terms are used to denote a property of an massive conformational freedom of carbohydrates. The applicabil- atom in a molecule, which is not recovered by molecular ity of GLYCAM to larger, more biologically relevant structures, such orbital theory. For example, if one assesses butanol by means as the binding of endotoxin to recognition proteins[14] or the of molecular orbital theory, the electrons “belonging” to the dynamics of lipid bilayers,[15] has also been demonstrated. hydroxyl group are dispersed throughout the entirety of the GLYCAM has attempted to break the paradigm of deriving molecule. If this is truly the case, then it becomes particularly partial charges based on a single molecular configuration. problematic when one attempts to explain why the presence Instead, it has been developed such that the partial charges of the functional group imparts reactivity (an electronic phe- are averaged over the course of a MD simulation, thus (albeit nomenon), when the electrons are not localized. simplistically) accounting for the dynamic nature of electronic Bader and co-workers[18] went some way to address this properties.[16] However, it must be emphasised that GLYCAM problem by performing an appealing partitioning of three- resides within the partial charge approximation to electro- dimensional space into atomic basins, which pictorially defines statics, thus severely limiting its predictive capacity. Sugars are an “atom in a molecule,” an approach called the Quantum

2362 Journal of Computational Chemistry 2015, 36, 2361–2373 WWW.CHEMISTRYVIEWS.COM WWW.C-CHEM.ORG FULL PAPER

Theory of Atoms in Molecules (QTAIM). The latter is the first the limit of infinitesimally short gradient vectors, one obtains a segment of a broader approach called Quantum Chemical smooth and (in general) curved path, which is the gradient Topology (QCT),[19–22] which analyzes quantum mechanical path. A gradient path always originates at a critical point and functions other than the electron density and its Laplacian. terminates at another critical point. Bundles of gradient paths The QTAIM partitioning has been demonstrated to have sev- form a topological object depending on the signature of the eral advantages compared to other partitioning schemes,[23–25] critical points that the object connects. All possibilities have and enjoy excellent transferability compared to other been exhaustively discussed before[28] but three ubiquitous schemes.[26] By use of Born’s interpretation of quantum possibilities are specified as follows: (i) the topological atom is mechanics, one generates a physical electron density, qðÞr , a bundle of gradient paths originating at infinity and terminat- from an ab initio wavefunction, wðÞr , obtained entirely from a ing at the nucleus, (ii) the bond path (or more generally first principle calculation. This electron density is then be com- atomic interaction line) is the set of two gradient paths, each pletely partitioned by use of the gradient operator originating at a bond critical point and terminating at a differ- ent nucleus, and (iii) the interatomic surface (IAS), which is a @ @ @ $ 5 ^i 1 ^j 1 k^ (1) bundle of gradient paths originating at infinity and terminat- @x @y @z ing at a bond critical point. An interatomic surface obeys the following condition where ^i; ^j; k^ are unit vectors along the x, y, and z axes, respectively, to generate the vector rqðÞr . The evaluation of rqðrÞ•nðrÞ50 8r 2 IAS (3) the gradient of a scalar field results in a vector field, the vec- tors of which are directed along the path of greatest increase where nðÞr is defined as the vector normal to the IAS. By find- in a function. As such, the vectors that define the field rqðÞr ing all surfaces that obey this condition, the molecule is com- point toward the greatest increase in the scalar field qðÞr .If pletely partitioned into topological atoms Xi, where the one were to map qðÞr by use of a contour plot, such that subscript denotes the atomic basin associated with the ith each contour represented the encasing of a surface with con- atom in a molecule. All key topological features of rqðÞr are stant electron density, termed an isosurface, then each vector summarized in Figure 1. in rqðÞr would intersect each isosurface orthogonally.[27] Integration over these atomic basins allows atomic proper- Points in the vector field defined such that rqðÞr 50 are ties Pf(X) to be defined and calculated. The universal formula termed critical points. Within the scalar field qðÞr they repre- from which all atomic properties can be calculated is sent a maximum, minimum or saddle point (mixture of mini- ð mum or maximum depending on direction). The identity Pf ðXÞ5 ds fðrÞ (4) of each critical point is revealed by assessing the curvature of X qðÞr at each point, achieved by evaluation of the Hessian of qðÞr where integration with respect to ds denotes a triple integra- 2 3 tion over all three Cartesian coordinates, confined to the @2q @2q @2q atomic volume X, and f(r) denotes a property density. For 6 7 6 @x2 @x@y @x@z 7 example, if f(r) equals the electron density q(r) then the corre- 6 7 6 2 2 2 7 sponding atomic property is the electronic population of the 6 @ q @ q @ q 7 HðÞq 5 6 7 (2) 6 @y@x @y2 @y@z 7 topological atom. If f(r) 5 1, then we obtain the atomic volume 6 7 4 5 and when f(r) 5 q(r)R‘m(r) the topological atom’s multipole @2q @2q @2q moments,[29] where R (r) is a spherical tensor[30] of rank ‘ @z@x @z@y @z2 ‘m and m. Others have shown[31] the better agreement with refer- This Hessian matrix is a real symmetric matrix and hence ence electrostatic potentials of topological multipole moments Hermitian. Therefore, its eigenvalues are real and they express compared to CHELPG charges. A further advantage of QCT is the magnitude of the curvature along each of the principal that the finite size and nonoverlapping nature of the topologi- cal atoms avoids the penetration effect, which may otherwise axes, which are marked by the direction of the corresponding appear in the calculation of intermolecular interaction energies. eigenvectors. The nature of the critical point in question is then given by two easily evaluated parameters: the rank (x) Kriging and signature (r) of the critical point, where the former is defined as the number of nonzero eigenvalues of qðÞr , and We can only outline kriging here, for more technical details the latter as the sum of the signs of the eigenvalues. the reader is referred to our work on histidine.[32] In general, a A fundamental result in the topology of rqðÞr , of great machine learning method is trained to find a mapping importance in the following, is the partitioning of a molecular between an input and an output. The machine learning system into topological atoms. A key feature necessary to method kriging[33–35] can also be seen an interpolative tech- achieve this result is the gradient path. An easy way to grasp nique able to predict the value of a function at an arbitrary what this is to think of a succession of very short gradient vec- d-dimensional point, x,giventhevalueofthefunctionatn tors, one after the other and constantly changing direction. In different points, fgx1; x2; ...; xn ,inthisd-dimensional space.

Journal of Computational Chemistry 2015, 36, 2361–2373 2363 FULL PAPER WWW.C-CHEM.ORG

Figure 1. (left) A contour plot of the electron density of in molecular plane of furan superimposed onto a representative collection of gradient paths. Atoms are represented by black circles, where the gradient paths terminate. Interatomic surfaces are highlighted as solid curves, and contain bond critical points (black squares). A ring critical point (triangle) is also shown in the center of the furan. (right) A 3D representation of the topological atoms in furan, in the same orientation as the left panel. The molecule is capped by the q 5 0.0001 au envelope and the bond critical points are marked in purple.

Kriging is apt at modeling high-dimensional function spaces, power for high-dimensional functions, and so kriging is a and so is particularly stable when considering the conforma- feasible machine learning method for our purposes. tional space of large molecules. A cornerstone of kriging is There are several intricacies involved with the evaluation of that if two input points are very close together in space a kriging model, in particular the manner by which multidi- then their output values are also very close. To put this more mensional functions are dealt with. For a d-dimensional vector formally, consider the example of three points x , x0; and x in function space, x, each dimension, or feature, is assigned a 0 such that x is closer to x than to x within the d-dimen- parameter hd, which maps the variance of fðÞx with respect to sional space, i.e. jx2x j < jx2 x0j. As a result, the function a change in the dth feature. Consider, for example, the function values of fðÞx and fðÞx should be correlated more so than mapped in Figure 2. In relative terms, whilst fxðÞ; y changes the values of fðÞx0 and fðÞx . significantly in response to a change in x, it is essentially invar- The function values fgfðÞx1 ; fðÞx2 ; ...; fðÞxn from which iant with respect to a change in y. As such, y is relatively insig- fðÞx is composed form a basis set that is necessarily only nificant when assessing the correlation between two points in complete if each point within d-dimensional space has been function space, i.e. fxðÞ; y is much more dependent on x than sampled, that is, as n !1. Hence, kriging only gives an on y. As a result, x is assigned a higher “importance” than y, approximate value for any predicted point fðÞx , and the which is reflected in the value of hx, where hx > hy. accuracy of its prediction increases as n !1. An adequate Kriging is a kernel method in view of the kernel function at kriging model typically requires at least 10 3 d data points, its heart. This function enables kriging to operate in a high- which is well within reach of modern day computational dimensional, implicit feature space without ever computing the coordinates of the data in that space. Instead, only the inner products between the images of all data pairs in feature space need to be computed. The kernel that we evaluate when obtaining a kriging model is a function of the correla- i j tion between two points in feature space,x and x , such that the correlation between the points, R xi; xj , is given by "# Xd i j i j ph R x ; x 5exp 2 hhjxh2 xhj (5) h51

i j Brief analysis of this function shows that, if xh and xh are sit- uated closely together for many features h, then the argument of the exponential tends toward zero, leading the correlation between the two points to tend toward one. Note that if the hth feature is relatively unimportant, it will be assigned a low th hh value. As a result, the h term in the sum becomes smaller if xi and xj are relatively far apart, leading to an increased Figure 2. Plot of fxðÞ; y against the two dependent variables. Note how h h the value of fxðÞ; y is relatively invariant in y compared to x, which results correlation between the two points, which demonstrates that i j in hx > hy . fxh and fðxhÞ are similar.

2364 Journal of Computational Chemistry 2015, 36, 2361–2373 WWW.CHEMISTRYVIEWS.COM WWW.C-CHEM.ORG FULL PAPER

Given a set of n points in feature space, an n3n correlation given training set size. These data are used to construct sepa- matrix, R^ , is defined, whose elements are defined as the cor- rate kriging models for each kriging center. From this, the krig- relation between the ith and jth points (and as such is symmet- ing model is then able to, given an arbitrary point in ric). The task at hand is then to minimize the mean squared conformational space, predict the associated multipole error of prediction of the kriging estimator. It can be shown moments which accompany such a position. This method has that this is equivalent to maximizing the likelihood function L, been developed and tested substantially within our group, which is given by and gives very agreeable results for a number of distinct [32,38] chemical species, but is nevertheless necessarily a subject t 21 1 2ðÞy21l R ðÞy21l of intense ongoing refinement. L5 n n 1 exp 2 ; (6) ðÞ2p 2ðÞr2 2 jRj2 2r where 1 is a column vector of ones, t denotes the transpose, Conformational sampling 2 r is the process variance, and l is a constant term that Here, we present a conformational sampling methodology models the global trend (i.e. “background”) of the column that utilizes the normal modes of a molecular system as a vector y of observations. This formula arises from the defini- means for dynamically evolving the system. This methodology tion of a Gaussian process and is not discussed here. For our has been used before in our lab for amino acids[32,39,40] and purposes, it is more convenient to maximize the natural small molecules[41] but this is the first time we report it in logarithm of this function, which is done analytically (see Sup- great detail. Each normal mode has a corresponding frequency porting Information of Ref. 36) by differentiation with respect that is calculated by diagonalization of the mass-weighted 2 to r and l and setting the respective derivatives to zero. Hessian, H, the details of which are incorporated in the Sup- When these optimal values for r2 and l are substituted back porting Information. With these frequencies, a system of equa- into eq. (6), one obtains the “concentrated” log-likelihood tions of motion is obtained, which permits for the function[32],or conformational evolution of the molecular system in time. n 1 These equations take a harmonic form, and are elaborated log L52 log r^2 2 log jR^j (7) 2 2 upon in the Supporting Information. Expressing the system in a basis of internal coordinates where L is the likelihood and r^ is the process variance (a con- results in six of the 3N Cartesian degrees of freedom possess- stant). The parameters h and p, which are d-dimensional vec- ing a frequency of zero (these correspond to the global trans- tors containing the individual feature parameters mentioned lational and rotational degrees of freedom), and so we need previously, must be optimized, which is equivalent to maximiz- only evaluate the 3N–6 “vibrational” equations of motion.[42] ing the “concentrated” likelihood function. From eqs. (5) and We refer to these Nvib53N26 degrees of freedom in the inter- (7) it is clear that log L is a function of h and p. The function nal coordinate basis as “modes” of motion. The transformation log L is the quantity that needs to be maximized, which is from a mass-weighted cartesian coordinates, q, to the set of done by another machine learning method called particle internal coordinates, s, is attained by evaluating the 3N x 3N [37] swarm optimization. We can then make a prediction of the transformation matrix, D, which satisfies output at a new point x with the optimized kriging parame- ters h and p, using the formula s5Dq (9)

Xn Note that this transformation retains the mass-weighting of y^ x 5 l^1 a u x2xi ; (8) ðÞ i q. Construction of D is undertaken by defining six orthogonal i51 vectors corresponding to the global translational and rota- th 21 tional degrees of freedom of the system, as defined by the where ai is the i element of the vector a5RðÞy21l^ where l^ is the (known) maximized mean l, while u x2xi is calcu- Sayvetz conditions. To implement these, the system must be lated[32] via eq. (5). specified in a global reference frame (sometimes termed the In our implementation of machine learning, each atom Eckart frame), the origin and axes of which coincide with the within a molecule is termed a kriging center, with a respective centre of mass and principal axes of inertia, respectively. Suf- multipole expansion (up to the hexadecapole moment) cen- fice to say that these conditions dictate the system possesses tered on the nucleus. Higher rank multipole moments are no net angular momentum relative to the Eckart frame, which highly sensitive to the change in conformation of the molecule rotates with the system. due to a fluctuating electric field. As such, we define each The above leads to the generation a set of six orthogonal multipole moment as a function of the 3N-6 degrees of free- vectors, which are invariant under global translational and dom of the molecular system, pertaining to each kriging rotational motion. These vectors correspond to the first six col- center. umns of D. Since the internal coordinates form a mutually The molecule is distorted by means of energy input into orthogonal basis, the resultant Nvib26 columns are generated each of its normal modes (discussed in section 2.3), and the by means of a Gram-Schmidt orthonormalization procedure, multipole moments of each kriging center elucidated for a whereby the projection of Dj on Di, Pij, is given by

Journal of Computational Chemistry 2015, 36, 2361–2373 2365 FULL PAPER WWW.C-CHEM.ORG

Di Dj to evolve the modes of motion and replicate the vibrational Pij5 (10) Di Di dynamics of the system. The total energy available to the sys- tem is given by the expression for thermal energy, Note that the columns of D are also normalized by this pro- E5NvibkT=2, and is stochastically distributed throughout the cess. If Di and Dj are orthogonal, then Pij50. If Pij 6¼ 0, then modes. The phase factors of the modes, /, are also randomly Pij is subtracted from Pj and the process iterated until Pij50. assigned: if /50 for all modes, then they oscillate in unison, Generalizing to account for our Nvib columns, which corresponds to a photonic single frequency excitation. P 6 Instead, we assume the modes to resonate out of phase with D75 D721 i51 Pi7 P one another, as energy transfer to each mode from an external 7 D85 D821 Pi8 heat bath will be predominantly decoherent. i51 (11) Ӈ The sole remaining issue is the choice of a dynamical time- P step with which to evolve the various modes of motion. Our n21 Dn5 Dn21 i51 Pin choice is based on the desire to ensure a single oscillation of a mode is sampled uniformly, i.e. we do not want to bias our where 1 is column vector of ones. As before, this procedure sampling toward specific regions of the period function that is iterated until P 50 8 i; j, which results in the fgD forming ij n describes the evolution of the mode. We obtain the time a mutually orthonormal set. In computational terms, the period of the mode as Ti51=mi, and subsequently ensure that threshold value of the Pij, which we require before considering the sampling methodology permits ncycle points to be eval- the to be mutually orthonormal, is 1028 . fDng O uated along a single oscillation. In this case, the dynamical The mass-weighted Hessian H, outlined in Supporting Infor- th timestep for the i mode, Dti51=mincycle. ncycle is left as a user- mation, is transformed into the internal coordinate basis, by defined input, and is set to ncycle 5 10 in the following work. use of D Additionally, the distribution of the total energy throughout

> the modes is considered a dynamic quantity, and so for every Hs5 D HqD (12) nreset samples that are output, the energy is randomly redis- tributed throughout the system. The phase factors are also where the subscripts denote the basis in which these quanti- redefined at the same frequency. Again, n is left as a user- ties are expressed, and T denotes the transpose. To evaluate reset defined parameter, and is set as n 5 2 in the following. the frequencies of the various modes of motion, we require reset We wish to clarify an issue in order to avoid misinterpreta- diagonalization of Hs, tion. The above methodology is not meant as an exact tech- 21 nique for the exploration of the molecular potential energy E HsE5Ik (13) surface (PES). By truncating the Taylor series of the potential

where E denote the eigenvectors of Hs and I is the identity energy at second order, we essentially model the local PES as matrix. In our protocol, this is achieved by tridiagonalizing the a harmonic well, which is obviously a simplification. However, Hessian by the Householder algorithm, followed by a QR we believe the above process to be a computationally efficient decomposition of the tridiagonal Hessian,[43] yielding a diago- means for generating molecular conformers. Moreover, the important alternative method of MD to generate conformers is nal Hessian, as required. The resultant eigenvalues, ðÞIk ii5ki, are related to the mode frequencies, mi,by not necessarily more realistic. Whilst the success of MD is not rffiffiffiffiffiffiffiffiffiffiffiffi in question, the validity of the forces fields that are currently k implemented is not guaranteed. m 5 i 8i51; ...; 3N (14) i 4p2c2 Computational Details where c is a factor which incorporates the speed of light, c,and the conversion from atomic units to reciprocal centimeters. The workflow proposed below essentially takes an ensemble Of course, six of these frequencies correspond to the global of configurations as input, and outputs kriging models for the translational and rotational degrees of freedom of the system variation in the atomic multipole moments as a function of the configuration of the system: and are zero, thus yielding Nvib non-zero frequencies. The reduced masses and force constants corresponding to the 1. The test system is sampled in accordance with the meth- modes with x 6¼ 0 are given by similar manipulations of these quantities. The reader is again directed to Ref. [42] for a dis- ods outlined in section 2.3. The general idea is to sample th as much of configuration space as would form an ensem- cussion of their calculation. The amplitude of the i mode, Ai, ble for the true physical system along to the course of a is given by rearrangement of the familiar expression for the 2 dynamical trajectory. This subsequently allows for the for- energy of a simple harmonic oscillator, E5kiAi =2 rffiffiffiffiffi mation of a kriging model that will be used in a purely 2E interpolative context. Ai5 (15) ki 2. Single-point calculations are performed on each sample and

where ki is the force constant of the mode of motion, and E is the resultant ab initio electron density is partitioned by QCT the energy available to it. We now have all quantities required software. The multipole moments of the topological atoms

2366 Journal of Computational Chemistry 2015, 36, 2361–2373 WWW.CHEMISTRYVIEWS.COM WWW.C-CHEM.ORG FULL PAPER

are subsequently obtained. This allows a kriging model to has 25 individual multipole moment kriging models, the data evaluate a functional form corresponding to the evolution analysis obviously becomes overwhelming. However, this anal- of the various multipole moments as a function of ysis is unnecessary owing to the uniqueness of the Taylor conformation. expansion from which the multipole moments arise. Since the 3. The sample set is split into a nonoverlapping training set electrostatic energy is computed from two such unique series, and test set. The training set is utilised for training of our then if the electrostatic energy is correctly predicted, the mul- kriging models, i.e. these are the points that the kriging tipole moments must also be correct by deduction. Note that function must pass through. The test set is not trained for, this consideration is valid for a single atom-atom interaction. but is used after the construction of the kriging models to In order to gauge the models’ validity, we plot a graph col- evaluate the errors associated with their predictions. loquially termed an “S-curve” owing to its typical sigmoidal 4. A kriging model is built for each multipole moment of each shape but of course it is really a cumulative distribution func- atom, which allows for the generation of a smooth interpola- tion. The S-curve plots the absolute deviation of the predicted tive function, mapping the evolution of the multipole energy from the ab initio energy, predicted from the ab initio moment against the conformational parameters of choice. multipole moments, after having evaluated the multipole By use of particle swarm optimization, we optimize our krig- moment interactions. Put more precisely, the predicted multi- ing parameters, fgh; p to obtain an optimal kriging model. pole moments form an energy that is subtracted from an 5. The kriging models are assessed by making them predict energy obtained from the ab initio moments. Then the abso- the multipole moments for each atom in a system whose lute value of this difference is taken. Hence compensation of configuration has not been used for training of the krig- errors is not allowed because first the difference is taken and ing model. However, we do possess the ab initio multi- then the absolute value. These energetic deviations are plotted pole moments for this configuration. As such, we against the percentile of test configurations that fall on or evaluate the energy associated with all 1–5 (i.e. two below the given energetic deviation. nuclei separated by four bonds) and higher (1–n, n > 5) Our aim is then twofold: the first is to reduce the upper tail order interatomic electrostatic interactions as given by of the sigmoid such that the 100th percentile error is conver- the predicted multipole moments from the kriging mod- gent at as low an error as possible. This corresponds to the els, and the equivalent energy as given by the (exact or predictions being uniformly good across the test set with no original) ab initio atomic multipole moments. We subse- spurious predicted interactions. Our second aim is to shift the quently assess the deviation of the kriging predictions S-curve as far down the abscissa (i.e. to the left) as possible, from the ab initio electrostatic energies. which ensures the average error associated with our predic- tions is as low as possible. The first goal is achieved by certify- The choice of 1–n (n 5) interactions over the conventional ing that the training points used for the construction of the 1–n (n 4) serves a twofold purpose: (i) avoiding any poten- kriging models form the boundaries of configurational space tial divergence in the electrostatic energy between two atom- with respect to our sample set. This boundary checking guar- centered multipole moment expansions, (ii) avoiding any antees that the kriging model is being asked to interpolate issues from a coupling of torsional and electrostatic energetics. from training data. Boundary checking has not yet been imple- In other work from this lab, to be published soon, we show mented, but we propose a simple means by which this could that short-range electrostatic energy 1–n (n < 5) can be satis- be accomplished. The initial geometry from which we start factorily kriged. Nonelectrostatic energy contributions can be sampling may be approximated as occupying the center of calculated within the QCT context and again adequately the sampling domain. The Euclidean metric in the (3N-6)- kriged, a result that will be published elsewhere. dimensional conformational space decides which sample Note that the electrostatic energy[38] is the final arbiter in points form the boundaries of the sampling domain by their the validation of the kriging models, rather than the atomic distance to the initial geometry in conformational space. The multipole moments themselves, which are the kriging observa- second goal is attained by consistent improvement of the krig- tions. The molecular electrostatic energy is calculated by a ing engine, and making sure that the test points are uniformly well-known multipolar expansion[30] involving a multitude of close to the training data, allowing for efficient interpolation. high-rank atomic multipole moments.[41] This expansion is Note that we do not necessarily choose test points that are truncated to quadrupole-quadrupole (L 5 5) and rank- close to the training points. In other words, the training and equivalent combinations (dipole-octopole and monopole-hexa- test sets are constructed independently. By ensuring that the decapole). Second, the interatomic contributions to the total training data is uniformly distributed throughout the sampling molecular electrostatic energy are limited to 1–5 and higher. domain, the average “distance” between an arbitrary test point Whilst somewhat indirect, the validation through energy and a training point will equal that between some other arbi- rather than multipole moment has a twofold purpose. Primar- trary test point and another training point. This guarantees no ily, the energy is the quantity that will be used for dynamical spurious predictions in under-trained regions of conforma- simulations, and so is the ultimate descriptor that we wish to tional space. Of course, it is prudent to invoke some form of evaluate correctly. Second, the alternative would be to assess importance sampling, which yields a greater sampling density the predictive capacity of each individual kriging model. For a in more “important” regions of conformational space, but this system with any sizeable number of atoms, where each atom issue has not been explored.

Journal of Computational Chemistry 2015, 36, 2361–2373 2367 FULL PAPER WWW.C-CHEM.ORG

This procedure makes sure that the kriging focuses on the var- iation of atomic multipole moments within the molecule. Oth- erwise, when referring to the global frame (rather than the ALF), the three components of an atomic dipole moment, for example, vary upon rigid rotation of the whole molecule. Training for such a variation is useless. The same principle applies to atomic multipole moments of rank ‘ 2. The details of the atomic local frame chosen for our work are out- lined elsewhere.[49] Briefly, the origin of the ALF is the nuclear position of the atom of interest, which is kriged. The heaviest (by atomic number, see Cahn-Ingold-Prelog rules) nucleus, which is directly bonded to the atom of interest, determines the ALF’s x-axis, the second heaviest determines the xy-plane such that an orthogonal y-axis can be installed. The three atoms involved in the ALF are described by the first three fea- tures: the distance between origin and x-nucleus, the distance between origin and “y-nucleus” and the corresponding sus- pended angle. The non-ALF atoms are located by features that coincide with the familiar spherical coordinates (r, h, u), expressed with respect to the ALF. Kriging of each multipole moment (up to the hexadecapole moment) was performed for each atom by the in-house program FEREBUS 1.4 and models consisting of N training examples were generated. Atom-atom interaction energies of 1-5 and higher were computed in a test set of 200 arbitrary conformations, using the ab initio mul- tipole moments, and compared to the interaction energies from multipole moments predicted by the kriging models. Errors are given in the form of so-called S-curves, which map Figure 3. Comparison of erythrose and threose in the open chain (topolog- the percentile of conformations within the test set, predicted ical atoms and molecular graph) and in the ring configurations of a and b furanose (in traditional -and-stick representation). [Color figure can be up to a maximum error chosen, which is read off on the viewed in the online issue, which is available at wileyonlinelibrary.com.] abscissa. Here, we give a brief overview of how the force field we propose could be utilised within the context of MD. We limit We work on the tetrose diastereomers erythrose and the discussion to the evaluation of electrostatic interactions, threose, the smallest carbohydrates that adopt open chain and but work is currently being undertaken within our group to furanose forms. The particular conformations studied are given establish the framework for an entire force field,[50] which devi- in Figure 3. Energetic minima were provided by Prof Alkorta, ates considerably from the terms arising in a classical force who had previously conducted a PES scan of these species,[44] field. The non-electrostatic terms are also obtained via the and are reported more thoroughly elsewhere.[45] The chemical QCT partitioning of molecular energy, originally derived from structures of these molecules are given in Figure 3. All geome- work in Ref. 51 and then elaborated in an approach called tries were subsequently optimized by the program GAUS- Interacting Quantum Atoms (IQA).[52] At designated points SIAN03[46] at the B3LYP/6-31111G(d,p) level of theory. The over the course of a MD simulation, the conformational state Hessians were computed for each geometry, and utilized in of the system is evaluated. At this point, atomic multipole the conformational sampling methodology we have described moments, up to the hexadecapole moment, can be extracted in the previous section, allowing for the output of 2000 geo- from the kriging models, and subsequently utilized for the metries for each system. Single point calculations were per- evaluation of interatomic electrostatic interactions. In this way, formed on each sample at the B3LYP/apc-1[47] level of theory. we capture the conformational dependence of the atomic The apc-1 basis set (which is a polarization-consistent (pc) multipole moments. Separate kriging models are obtained for double-f plus polarization basis set with diffuse functions) was the non-electrostatic terms, that is, the intra-atomic energy used for the DFT calculations, since this family of basis sets (both kinetic[53] and potential), the short-range interatomic has been specifically optimized for DFT. The resultant wave- Coulomb energy not obtained by multipolar expansion, and functions are then passed on to program AIMAll,[48] which cal- the interatomic exchange energy. culates the atomic multipole moment according to QCT. We The issue of coordinate frames needs further clarification are only interested in the internal degrees of freedom of the because the Cartesian MD frame coordinate system is not the molecular configurations. same as the local coordinate frame within which we have eval- Hence the atomic multipole moments must be expressed in uated the atomic multipole moments (ALF). Prior to invoking a an atomic local frame (ALF) rather than in the global frame. kriging model for the evaluation of multipole moments, the

2368 Journal of Computational Chemistry 2015, 36, 2361–2373 WWW.CHEMISTRYVIEWS.COM WWW.C-CHEM.ORG FULL PAPER

Figure 4. S-curves corresponding to all erythrose (left) and threose (right) systems studied. The open chain forms are systematically better predicted than the corresponding furanose forms. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] conformational state of the system must be converted from a Results and Discussion Cartesian frame of reference to an ALF. From here, atomic multi- pole moments can be evaluated corresponding to the current Single minimum state of the system. Previous work has derived the forces arising The lowest energy conformer for each system was chosen as from the interactions of atomic multipole moments within the an input structure for sampling. S-Curves were subsequently [54] Cartesian frame of reference, which requires the partial deriv- generated for each of these training sets by the methodology atives of the multipole moments with respect to the ALF outlined in the previous section. The S-curves are given in Fig- degrees of freedom. These terms have an analytical functional ure 4, and the accompanying mean errors in Table 1. form, and so computation of the forces in the Cartesian frame of The first point to notice is that the open chains for both reference can be performed explicitly. These forces can subse- erythrose and threose are modeled by kriging to a significantly quently be utilized by the standard MD procedure. better standard than the furanose forms. We may, however, An embryonic workflow has been integrated within the MD immediately attribute this to the number of 1-5 and higher package DL_POLY 4.0. Currently, an atomic kriging model is interactions occurring in these systems. For both open chains, loaded into memory and the relevant data stored, followed by 25 interactions are required to be evaluated for comparison to removal of the kriging model from memory. In this way, the the energies produced from the ab initio multipole moments. dynamic memory requirements have not yet exceeded roughly The numbers of interactions requiring evaluation for the fura- 200 Mb, and are thus well within the capabilities of modern nose forms comes to 39, which is virtually twice the amount computational resources. Of course, memory management is evaluated in the open chain forms. We would subsequently crucial to the speed of the proposed methodology, and so will expect a proportional relationship between the number of require a great deal of fine-tuning. However, much speedup can interactions required for evaluation and the mean error attrib- undoubtedly be accomplished by a number of techniques, e.g. uted to the kriging model. Whilst we see this to be roughly caching of regularly used quantities and parallel implementation. true when comparing the errors on the threose open and a- Finally, it is too early to extensively comment on the compu- furanose forms, the errors appear disproportionately higher for tational cost of the current approach. It would be na€ıve to the other systems. directly compare the flop count of the current force field with Kriging is an interpolative technique, and so is not suited for a traditional one without appreciating that (i) extra non- extrapolation. However, we point out that the kriging engine nuclear point charges are needed to match the accuracy of is still predictive for extrapolation- in this case, the prediction multipole moments and the former propagate over long falls to the mean value of the function. Obviously this is not range, (ii) multipolar interactions drop off much faster than 1/r, ideal for highly undulatory functions. However, considering depending on the rank of the interacting multipole moments, how the atomic multipole moments do not fluctuate over vast which depends on the interacting elements themselves (see ranges, the mean will often represent a respectable prediction [55] extensive testing in the protein crambin ), (iii) the efficiency to the function value. The kriging model can be refined in an [56] of the multipolar Ewald summation that is being imple- iterative fashion, whereby extrapolation points are added to mented in DL_POLY 4.0, (iv) the dominance of monopolar interactions at long range (vast majority of interactions) and the (v) outstanding fine-tuning of the kriging models at pro- Table 1. Mean errors associated with the S-curves given in Figure 4. duction mode. The current force field may well be an order of Mean error (kJmol21) magnitude slower than a traditional force field. This estimate Open chain a-Furanose b-Furanose and the fact that the current force field contains electronic information invite one to compare its performance with on- Erythrose 0.27 1.32 1.30 Threose 0.34 0.83 1.47 the-fly ab initio calculations instead.

Journal of Computational Chemistry 2015, 36, 2361–2373 2369 FULL PAPER WWW.C-CHEM.ORG

the uniform increments of 100 in training set size are not matched by equal uniform strides of improvement in S-curve shape and position. An alternative way to gauge the improve- ment in prediction with increasing training set size is monitor- ing the average prediction error for each S-curve. This value cannot be read off for an S-curve in Figure 5 but can be easily calculated. Figure 6 plots the average prediction error for each S-curve against increasing training set size: red for “Old FEREBUS” and blue for “New FEREBUS,” a development version of our kriging engine, which differs in a number of ways to the “Old FEREBUS.” We include both “Old FEREBUS” and “New FEREBUS” data to establish whether any functional forms of average pre- diction error against increasing training set size are conserved with respect to improvements in the engine. The “Old Figure 5. S-curves for erythrose open chain at various training set sizes. FEREBUS” data show a plateau in the average error (left pane) Note the progression of the S-curves towards the lower prediction errors as the training set size increases. However, owing to the logarithmic at a training set size of about 1200, after an initial decrease in abscissa, this does not correspond to a uniform enhancement of a kriging this error. This plateau would be rather problematic, as it model given a consistently larger training set size. [Color figure can be implies some maximum efficiency of the kriging engine, viewed in the online issue, which is available at wileyonlinelibrary.com.] beyond which there is no reward for an extension of the train- ing set. However, this is not the case for the New FEREBUS data (right pane). the training set. This is a commonly used technique in the Learning theory states that for a machine learning method field of machine learning. Whilst not currently implemented of this type (kriging), the mean prediction error should within our methodology, the iterative protocol is a technique decrease asymptotically toward zero, with functional form pffiffiffi which is currently being explored. A 1 B/n or C 1 D/ n, where n is the training set size, and A, B, Regardless of the problem encountered in the above, we C, and D are fitted constants. Figure 6 plots these asymptotes,

see that it is easily remedied by strategic sampling of confor- where (Aold 5 0.105; Bold 5 348.11) and (Cold 520.293;

mational space. In fact, these problems are ubiquitous to Dold 5 22.06), as determined by regression analysis against the machine learning techniques, and have been encountered in Old FEREBUS data, each with R2 coefficients of 0.93. Similarly,

studies which attempt to implement neural networks to pre- for the New FEREBUS data, constants of (Anew 5 0.097; [57] dict a PES. In this field, the problems have been solved to Bnew 5 293.38) and (Cnew 520.193; Dnew 5 18.60) were some extent by making the prediction engine issue a warning obtained. These fitted asymptotes both possess R2 values of to the user that a point is being predicted which lies outside 0.98. As such, we conclude that the decay of the mean predic- of the training set. This proves to be advantageous as the user tion error of our machine learning method possesses, as yet, may then recognize that the point should be included within inconclusive functional form. the training set as it obviously lies within an accessible portion The results in Figure 6 are consistent with the behavior seen of conformational space. The point can then be included in similar interpolation methods[58]: for an infinite training set within the training set, and one can generate refined models size, the mean prediction error will asymptote to zero. How- by undergoing this process iteratively. ever, for the methodology to remain computationally feasible, some finite training set size will of course be required. Training set dependency So, for example, we find that for a mean prediction error of 0.3 kJ mol21, the training set would require about 1450 sam- We start by discussing the effects of increasing the training set ple points for either functional form taken as the decay of the pffiffiffi size on the prediction error for a kriging model. For this pur- prediction error, i.e. A 1 B/n or C 1 D/ n. pose, we use the erythrose open chain system owing to its A comment on the nature of the average error is in place higher conformational flexibility, which we assume amplifies here. In principle, the prediction error consists of the sum of the effects of training set size. Kriging models were generated the estimation error and the approximation error. From learn- for this system with training sets ranging from 700 to 1500 ing theory, one expects the estimation error only to go to sampling points, in increments of 100. The same test set (of zero. The bias–variance decomposition of a learning algo- 200 points) was reserved for prediction by all models. The rithm’s error also contains a quantity called the irreducible S-curves for this are given in Figure 5. error, resulting from noise in the problem itself. This error has As expected, the prediction errors of the S-curves in Figure been investigated some time ago in the context of tests on 5 systematically decrease as the training set size is increased. kriging of ethanol multipole moments[41] and is caused by the In other words, the S-curves move to the left with increasing small noise generated by the integration quadrature of the training set size, although this is not true for all parts of the atomic multipole moments. Second, any bias caused by an S-curves because they clearly intersect in many places. Overall inherent error in the ab initio method used, compared to the

2370 Journal of Computational Chemistry 2015, 36, 2361–2373 WWW.CHEMISTRYVIEWS.COM WWW.C-CHEM.ORG FULL PAPER

Figure 6. Mean prediction errors associated with S-curves for erythrose open chain as the training set size is increased. With the old kriging engine (left), a distinct plateau formed after roughly 1200 training examples, corresponding to no further improvement in the kriging model despite additional training points. However, the new kriging engine (right) appears to avoid premature plateauing, with additional kriging model improvement at higher training set pffiffiffi sizes. Regression fits of the FEREBUS errors against training set size, of functional forms A 1 B/n (blue) and C 1 D/ n (black), where n is the training set size, are also given. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] best method available (e.g. CCSD(T) with a complete basis Then, the molecular PES is approximated by a number of har- set), is not relevant in our error considerations. The reason is monic wells. If the input geometries are sufficiently close to that we always assess the performance of kriging training one another, the wells will overlap, and the PES may be against the (inevitably approximate) ab initio at hand, which explored seamlessly. Non-equilibrium normal mode conforma- we refer to the source of “the” ab initio data. tional sampling has also been demonstrated in a recent publi- cation.[59] This advance will facilitate a more thorough Multiple minima sampling of these higher energy parts potential energy surfa- ces. The validation of this methodology is presented in Part B The amount of conformational space available to molecular of Supporting Information. systems reaches levels which are entirely unfeasible for sys- For the open chain form of erythrose, 174 energetic minima tematic sampling as the number of atoms increases. As such, were found by an exhaustive search of conformational space. it becomes all the more prudent to obtain an efficient sam- Figure 7 plots the S-curves obtained for samples which have pling scheme for our purposes. As we have mentioned, our been generated from different numbers of up to 99 minima. sampling methodology is limited to local conformational The S-curves display increasingly poor prediction results as the exploration about some given input geometry, since the PES number of starting minima increases. The actual mean errors about that point is approximated as a harmonic well. As such, for these S-curves are summarized in Table 2. This trend has a to thoroughly explore conformational space, our methodology logical interpretation. As the number of seeding structures requires the usage of a number of such starting geometries. increases, the sampled conformational space grows in size. Given a fixed kriging model size, the sampling density therefore decreases. The kriging model then deviates from the true ana- lytical function, and the results from predictions deteriorate. Of course, thorough sampling of conformational space is an issue for parameterizing any force field, and by no means one that is resultant from our methodology. We may overcome this issue in two ways. The first is the ongoing improvement of our kriging engine to deal with larger training sets comprising more molecular configurations. The second is by undertaking sampling with only a subset of the energetic minima that are available. This is all the more valid an approach if most of the minima are very high in energy relative to the lowest-lying minima. These

Table 2. Mean errors corresponding to the S-curves depicted in Figure 7.

Figure 7. S-curves depicting the power of a kriging model as more ener- Number of minima 1 20 40 60 80 99 getic minima are utilized for conformational sampling. As the number of Mean prediction 0.27 0.85 0.9 1.23 1.30 1.37 minima utilized increases, the S-curves tend toward higher prediction error (kJmol21) errors. The kriging models which underlie these S-curves have a fixed train- ing set size of 700. [Color figure can be viewed in the online issue, which Note the “bunching” of prediction error when 60 energetic minima or is available at wileyonlinelibrary.com.] higher are used as seeds for the conformational sampling.

Journal of Computational Chemistry 2015, 36, 2361–2373 2371 FULL PAPER WWW.C-CHEM.ORG

regions of conformational space will be accessed very infre- [11] I. T. Todorov, W. Smith, K. Trachenko, M. T. Dove, J. Mater. Chem. 2006, quently during the course of a MD simulation, and so may be 16, 1911. [12] A. M. Salisburg, A. L. Deline, K. W. Lexa, G. C. Shields, K. N. Kirschner, J. sampled much more coarsely. This selective sampling is quite Comput. Chem. 2009, 30, 910. readily employed, and has been discussed at length in the litera- [13] M. L. DeMarco, R. J. Woods, Glycobiology 2009, 19, 344. ture. For example, Brooks and Karplus[60] found that a compre- [14] M. L. DeMarco, R. J. Woods, Mol. Immunol. 2011, 49, 124. [15] M. B. Tessier, M. L. DeMarco, A. B. Yongye, R. J. Woods, Mol. Simul. hensive sampling of conformational space for bovine pancreatic 2008, 34, 349. trypsin inhibitor could be achieved by evolving only the lowest [16] M. Basma, S. Sundara, D. C¸algan, T. Vernali, R. J. Woods, J. Comput. frequency normal modes of motion. Needless to say, this is read- Chem. 2001, 22, 1125. ily accomplished by our sampling methodology. [17] S. Cardamone, T. J. Hughes, P. L. A. Popelier, Phys. Chem. Chem. Phys. 2014, 16, 10367. [18] R. F. W. Bader, Atoms in Molecules. A Quantum Theory; Oxford Univer- Conclusion sity Press: Oxford, Great Britain, 1990. [19] P. L. A. Popelier, Quantum Chemical Topology: On Descriptors, Poten- We have demonstrated that the atomic multipole moments of tials and Fragments. In Drug Design Strategies: Computational Techni- a set of carbohydrates are amenable to the machine learning ques and Applications, L. Banting, T. Clark, Eds.; Roy. Soc. Chem., Great technique kriging. Whilst this has been done in the past for a Britain: Cambridge, Vol. 20; 2012; Chapter 6, pp. 120–163. [20] P. L. A. Popelier, On Quantum Chemical Topology. In Challenges and variety of chemical species including naturally occurring amino Advances in Computational Chemistry and Physics dedicated to acids, this is the first foray into the field of glycobiology. Krig- “Applications of Topological Methods in Molecular Chemistry”; E. Ali- ing is able to capture the conformational dependence of the khani, R. Chauvin, C. Lepetit, B. Silvi, Eds.; Springer: Germany, 2015. [21] P. L. A. Popelier, The Quantum Theory of Atoms in Molecules, In The multipole moments and make predictions, such that the error Nature of the Revisited; G. Frenking, S. Shaik, Eds.; in the electrostatic energy relative to that derived from ab ini- Wiley-VCH, Chapter 8, 2014; pp. 271–308. tio data is encouraging, given the popular aim is to obtain [22] P. L. A. Popelier, E. A. G. Bremond, Int. J. Quant. Chem. 2009, 109, errors below 4 kJ mol21. Indeed, the presented methodology 2542. [23] D. T. I. Nakazato, E. L. de Sa, R. L. A. Haiduke, Int. J. Quant. Chem. is immediately extensible to any term arising in an energetic 2010, 110, 1729. decomposition of a system. If some quantity is conformation- [24] B. Courcot, A. J. Bridgeman, Int. J. Quant. Chem. 2010, 110, 2155. ally dependent, then the dependence can be modelled by [25] S. Saha, R. K. Roy, P. W. Ayers, Int. J. Quant. Chem. 2009, 109, 1790. kriging. As such, an entire force field can be parameterized by [26] T. Verstraelen, P. W. Ayers, V. Van Speybroeck, M. Waroquier, J. Chem. Theory Comput. 2013, 9, 2221. the current methodology, reproducing ab initio quantities for [27] C. F. Matta, R. J. Boyd, The Quantum Theory of Atoms in Molecules; use in classical MD. This route is preferable to the computa- Wiley, 2007. tionally intensive approach of ab initio MD. [28] N. O. J. Malcolm, P. L. A. Popelier, J. Comp. Chem. 2003, 24, 437. [29] P. L. A. Popelier, Mol. Phys. 1996, 87, 1169. [30] P. L. A. Popelier, L. Joubert, D. S. Kosov, J. Phys. Chem. A 2001, 105, Acknowledgment 8254. [31] E. F. F. Rodrigues, E. L. de Sa, R. L. A. Haiduke, Int. J. Quant. Chem. The authors thank the BBSRC for funding a DTP PhD studentship. 2008, 108, 2417. [32] S. M. Kandathil, T. L. Fletcher, Y. Yuan, J. Knowles, P. L. A. Popelier, J. Keywords: quantum theory of atoms in molecules carbohy- Comput. Chem. 2013, 34, 1850. drates quantum chemical topology conformational sam- [33] D. G. Krige, J. Chem. Metall. Mining Soc. South Africa 1951, 52, 119. [34] D. R. Jones, J. Global Optim. 2001, 21, 345. pling kriging electrostatics multipole moments [35] C. E. Rasmussen, C. K. I. Williams, Gaussian Processes for Machine Learning; The MIT Press: Cambridge, USA, 2006. [36] C. M. Handley, G. I. Hawe, D. B. Kell, P. L. A. Popelier, Phys. Chem. Chem. Phys. 2009, 11, 6365. How to cite this article: S. Cardamone, P. L. A. Popelier. J. Com- [37] J. Kennedy, R. C. Eberhart, Proc. IEEE Int. Conf. on Neural Networks put. Chem. 2015, 36, 2361–2373. DOI: 10.1002/jcc.24215 1995, 4, 1942. [38] T. J. Hughes, S. M. Kandathil, P. L. A. Popelier, Spectrochimica Acta A 2015, 136, 32. ] Additional Supporting Information may be found in the [39] T. Fletcher, S. J. Davie, P. L. A. Popelier, J. Chem. Theory Comput. 2014, online version of this article. 10, 3708. [40] M. J. L. Mills, P. L. A. Popelier, Theor. Chem. Acc. 2012, 131, 1137. [41] M. J. L. Mills, P. L. A. Popelier, Comput. Theor. Chem. 2011, 975, 42. [1] M. L. DeMarco, R. J. Woods, Glycobiology 2008, 18, 426. [42] J. W. Ochterski, Vibrational Analysis in Gaussian. Vibrational Analysis in [2] C. H. Faerman, S. L. Price, J. Am. Chem. Soc. 1990, 112, 4915. Gaussian, Available at: http://www.gaussian.com/g_whitepap/vib.htm, [3] E. Juaristi, G. Cuevas, The Anomeric Effect; CRC press, 1994. 1999. [4] B. Lachele Foley, M. B. Tessier, R. J. Woods, WIREs Comput. Mol. Sci. [43] W. H. Press, B. P. Flannery, S. A. Teucholsky, W. T. Vetterling, Numerical 2012, 2, 652. Recipes, 2nd ed.; Cambridge University Press: Cambridge, 1992. [5] V. R. Rao, Conformation of Carbohydrates; CRC Press, 1998. [44] I. Alkorta, P. L. A. Popelier, Carbohydr. Res. 2011, 346, 2933. [6] K. N. Kirschner, R. J. Woods, Proc. Natl. Acad. Sci. USA 2001, 98, 10541. [45] L. M. Azofra, I. Alkorta, J. Elguero, P. L. A. Popelier, Carbohydr. Res. [7] E. Fadda, R. J. Woods, Drug Discov. Today 2010, 15, 596. 2012, 358, 96. [8] J. Kaminsky, J. Kapitan, V. Baumruk, L. Bednarova, P. Bour, J. Phys. [46] GAUSSIAN03 M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. Chem. A 2009, 113, 3594. A. Robb, J. R. Cheeseman, J. A. J. Montgomery, J. T. Vreven, K. N. [9] J. Cheeseman, M. S. Shaik, P. L. A. Popelier, E. W. Blanch, J. Am. Chem. Kudin, J. C. Burant, J. M. Millam, S. S. Iyengar, J. Tomasi, V. Barone, B. Soc. 2011, 133, 4991. Mennucci, M. Cossi, G. Scalmani, N. Rega, G. A. Petersson, H. Nakatsuji, [10] K. N. Kirschner, A. B. Yongya, S. M. Tschampel, J. Gonzalez-Outeirino, C. M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. R. Daniels, B. Lachele Foley, R. J. Woods, J. Comp. Chem. 2008, 29, 622. Nakajima, Y. Honda, O. Kitao, H. Nakai, M. Klene, X. Li, J. E. Knox, H. P.

2372 Journal of Computational Chemistry 2015, 36, 2361–2373 WWW.CHEMISTRYVIEWS.COM WWW.C-CHEM.ORG FULL PAPER

Hratchian, J. B. Cross, C. Adamo, J. Jaramillo, R. Gomperts, R. E. [53] T. L. Fletcher, S. M. Kandathil, P. L. A. Popelier, Theor. Chem. Acc. 2014, Stratmann, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli, J. W. 133, 1499:1. Ochterski, P. Y. Ayala, K. Morokuma, G. A. Voth, P. Salvador, J. J. [54] M. J. L. Mills, P. L. A. Popelier, J. Chem. Theory Comput. 2014, 10, 3840. Dannenberg, V. G. Zakrzewski, S. Dapprich, A. D. Daniels, M. C. Strain, [55] Y. Yuan, M. J. L. Mills, P. L. A. Popelier, J. Comp. Chem. 2014, 35, 343. O. Farkas, D. K. Malick, A. D. Rabuck, K. Raghavachari, J. B. Foresman, J. [56] H. A. Boateng, I. T. Todorov, J. Chem. Phys. 2015, 142, 034117. V. Ortiz, Q. Cui, A. G. Baboul, S. Clifford, J. Cioslowski, B. B. Stefanov, G. [57] J. Behler, Phys. Chem. Chem. Phys. 2011, 13, 17930. Liu, A. Liashenko, P. Piskorz, I. Komaromi, R. L. Martin, D. J. Fox, T. [58] M. A. Collins, Theor. Chem. Acc. 2002, 108, 313. Keith, M. A. Al-Laham, C. Y. Peng, A. Nanayakkara, M. Challacombe, P. [59] T. J. Hughes, S. Cardamone, P. L. A. Popelier, J. Comp. Chem. 2015, 36, M. W. Gill, B. Johnson, W. Chen, M. W. Wong, C. Gonzalez, J. A. Pople, 1844. In Gaussian, Inc., Pittsburgh PA, 2003, USA. [60] B. Brooks, M. Karplus, Proc. Natl. Acad. Sci. 1983, 80, 6571. [47] F. Jensen, J. Chem. Phys. 2002, 117, 9234. [48] T. A. Keith, AIMAll. 11.04.03 ed.; aim.tkgristmill.com, 2011. [49] Y. Yuan, M. J. L. Mills, P. L. A. Popelier, J. Mol. Model. 2014, 20, 2172. Received: 19 June 2015 [50] P. L. A. Popelier, Int. J. Quant. Chem. 2015, 115, 1005. Revised: 18 August 2015 [51] P. L. A. Popelier, D. S. Kosov, J. Chem. Phys. 2001, 114, 6539. Accepted: 10 September 2015 [52] M. A. Blanco, A. M. Pendas, E. Francisco, J. Chem. Theor. Comput. 2005, Published online on 8 November 2015 1, 1096.

Journal of Computational Chemistry 2015, 36, 2361–2373 2373 FULL PAPER WWW.C-CHEM.ORG

Realistic Sampling of Amino Acid Geometries for a Multipolar Polarizable Force Field Timothy J. Hughes,[a,b] Salvatore Cardamone,[a,b] and Paul L. A. Popelier*[a,b]

The Quantum Chemical Topological Force Field (QCTFF) uses stationary points of single amino acids in the gas phase. the machine learning method kriging to map atomic multipole Indeed, PDB/NM combines the sampling of relevant dihedral moments to the coordinates of all atoms in the molecular sys- angles with chemically correct local geometries. Geometries tem. It is important that kriging operates on relevant and real- sampled using PDB/NM were used to build kriging models for istic training sets of molecular geometries. Therefore, we alanine and lysine, and their prediction accuracy was com- sampled single amino acid geometries directly from protein pared to models built from geometries sampled from three crystal structures stored in the Protein Databank (PDB). This other sampling approaches. Bond length variation, as opposed sampling enhances the conformational realism (in terms of to variation in dihedral angles, puts pressure on prediction dihedral angles) of the training geometries. However, these accuracy, potentially lowering it. Hence, the larger coverage of geometries can be fraught with inaccurate bond lengths and dihedral angles of the PDB/NM method does not deteriorate valence angles due to artefacts of the refinement process of the predictive accuracy of kriging models, compared to the the X-ray diffraction patterns, combined with experimentally NM sampling around local energetic minima used so far in the invisible hydrogen atoms. This is why we developed a hybrid development of QCTFF. VC 2015 The Authors. Journal of Com- PDB/nonstationary normal modes (NM) sampling approach putational Chemistry Published by Wiley Periodicals, Inc. called PDB/NM. This method is superior over standard NM sampling, which captures only geometries optimized from the DOI: 10.1002/jcc.24006

Introduction tioned first principle information of the system trained for. QCTFF achieves this by relying on a machine learning method The rapid but accurate evaluation of potential energy for bio- called kriging,[11–13] which is increasingly being used[14–19] in molecular simulation continues to be a challenge. Next gener- the community of force field and potential design. ation force fields, which could eventually replace the Traditional force fields approximate energy through bonded traditional force fields, continue to be developed. Among the and nonbonded contributions that incorporate often loosely [1] [2] [3] [4] former are AMOEBA, XED, SIBFA, and ACKS2, which all defined atom types with their own set of experimentally or [5,6] advocate multipolar electrostatics, absent in classical archi- computationally obtained parameters. QCTFF operates outside [7,8] tectures. The Quantum Chemical Topological Force Field this traditional framework: its architecture does not distinguish [9,10] (QCTFF) shares this approach to improved electrostatic between bonded and nonbonded interactions, and atom types energy prediction but, on top of this, introduces machine do not need to be defined. Instead, QCTFF focuses directly on learning to handle electron density fluctuations in response to how atoms interact, allowing for a spectrum of covalency changes in nuclear configuration. QCTFF aims at capturing the rather than a bonded/nonbonded dichotomy. QCTFF maps end result of this polarization process rather than the process atomic properties (the output variables) to molecular coordi- itself. The machine learning models that QCTFF depends on nates (a set of input variables) using kriging. Therefore, a need to be properly trained with a sufficient number of con- QCTFF atom will be endowed with a number of kriging figurations, but perhaps more importantly, with relevant con- figurations. The work presented here deals with this problem, and does so in the context of real protein structures. This is an open access article under the terms of the Creative Commons Machine learning focuses on algorithms that can learn from Attribution License, which permits use, distribution and reproduction in data, in this case properties (multipole moments and energies) any medium, provided the original work is properly cited. of topological atoms. Machine learning proposes computa- [a] T. J. Hughes, S. Cardamone, P. L. A. Popelier Manchester Institute of Biotechnology (MIB), 131 Princess Street, tional methods that generate predictive models that map an Manchester M1 7DN, Great Britain output variable to a set of input variables. Models are then E-mail: [email protected] built through a training procedure using a set of input values [b] T. J. Hughes, S. Cardamone, P. L. A. Popelier with known output. QCTFF, which continues to be developed School of Chemistry, University of Manchester, Oxford Road, Manchester M13 9PL, Great Britain in our lab, is an innovative approach to predicting the energy Contract grant sponsor: BBSRC and AstraZeneca of a molecular system much faster than first principle calcula- VC 2015 The Authors. Journal of Computational Chemistry Published by tions can. For that purpose, QCTFF captures atomically parti- Wiley Periodicals, Inc.

1844 Journal of Computational Chemistry 2015, 36, 1844–1857 WWW.CHEMISTRYVIEWS.COM WWW.C-CHEM.ORG FULL PAPER models, each describing how an atomic property changes as a or protein crystallography, detailing the statistical populations function of the coordinates of the molecular system. and frequencies of the dihedral angles adopted by amino acid To build a QCTFF kriging model, example molecular geome- side chains. These libraries may then be used to predict, build, tries must be obtained to train the model. QCTFF develop- design and solve new protein structures.[41] Torsional energy ment targets the simulation of biomolecules, in particular terms are so important that they receive special attention in proteins, hence amino acids are molecules of key interest. force field design, see Ref. [42] for a recent example. When sampling amino acid geometries as input for kriging Normal modes sampling has proved successful at sampling models, the sampled geometries must include all the confor- conformational space around an input energetic minimum or mations that one may reasonably expect to occur during the stationary point. However, one must consider whether the gas simulation of a protein. Our current paradigm for the sampling phase minimum energy geometries of an amino acid accu- of molecular geometries is to use a NM sampling approach. To rately mimic the amino acid structures found in proteins. We do this, a small number of stationary points on the potential note that, in more general terms, the biases induced by data- energy surface of a given molecule of interest are located, and sets that are restricted to stationary or only little deformed the NM at each stationary point (or local energy minimum) are structures were also discussed within the context of DFT.[43] It calculated. Energy is then put randomly into the NM to distort is accepted that amino acids and polypeptides have an intrin- the molecule, and “snapshots” are taken to obtain distorted sic propensity for specific molecular configurations, and that geometries. The minimum energy conformations of all 20 nat- this preference can differ depending on whether the amino urally occurring amino acids have been reported in a compre- acid exists in a folded protein tertiary structure or a disor- hensive study,[20] all obtained at the same level of theory. dered, solvated state.[44] Ramos and coworkers[45] performed Kriging models built from NM sampled geometries have been ab initio calculations on all 20 natural amino acids using both used to predict successfully the atomic multipole moments of gas phase and PCM solvation. Of the 323 chemical bonds and a range of molecules. These include small organics, amino 469 angles present, they found mean unsigned errors of less acids, and hydrogen bonded dimers.[17,21–26] Recently, the elec- than 0.02 A˚ and 38 between the PCM and gas phase bonds tronic kinetic energy of QCT atoms (see Quantum Chemical and angles, respectively. However, the environment of a globu- Topology section) has been successfully incorporated into krig- lar protein is different to that of a hydrated polypeptide due ing models for methanol, NMA, glycine and triglycine.[27] Intra- to a number of factors such as intraresidue hydrogen bonding atomic terms such as the (electronic) kinetic energy are not and steric considerations that have an effect on the amino explicitly incorporated in classical force fields but to gain an acid conformation. appreciation of chemical phenomena, such as steric hindrance, The work of Jha et al.[46] clearly shows the effect of the envi- intra-atomic terms have been proven important and therefore ronment on the backbone angles U and W. They compared should be included in QCTFF.[28] Some interesting work quanti- the geometric preferences of all 20 amino acids using data fies the steric effect, still within QCT, but in the context of from two protein coil libraries: one including residues in struc- experimental[29] electron densities, conceptual DFT,[30] and tural motifs, and the other only those residues in disordered energy decomposition analysis.[31] sections of the proteins. The ratios of structures found in the The only other alternative sampling approach investigated b-sheet, PPII and a-helical regions were clearly different draws snapshots from a molecular dynamics simulation, which between the two libraries. To further demonstrate the effect of has been done[32] for liquid water. In the current work, a third environment on the structural preferences of amino acids, the sampling method is investigated, one that is pivotal for a real- distribution of structures obtained from both coil libraries also istic sampling of amino acid conformations and one that incor- differed significantly from those obtained experimentally for porates experimental information (X-ray structures). the central residue of Gly-X-Gly tripeptides (where X is a natu- Amino acids are typically described as consisting of two rally occurring amino acid).[47,48] It has been shown, both units: a back bone and a side chain. The conformational pref- experimentally (using NMR J couplings) and computationally, erence of the backbone unit is dictated by the secondary that disordered amino acid residues favor specific regions of structure of the proteins and is well understood. The dihedral the Ramachandran plot (typically b-sheet and PPII regions) in angles denoted U and W describe the back bone using Rama- contrast to the conformational populations found in ordered chandran plots. These plots relate the values of U and W to a protein secondary structures.[44,46,49–51] It has also been shown particular secondary structure. Different amino acids display that the side chain rotamer preference of an amino acid is preferences for different regions of the Ramachandran plot, related to the secondary structure of the polypeptide in which and a thorough investigation of the preferences for all 20 nat- it resides,[52] and this relationship between environment and urally occurring amino acids has been performed before.[33,34] structure has been used successfully in rotamer libraries to The side chain of an amino acid may exist as a number of dif- predict side chain conformations.[53] In the long term, these ferent rotamers depending on the side chain dihedrals. Exten- results imply that gas phase energy minima of single amino sive work has been undertaken by other groups to understand acids used to sample geometries from, are insufficient to sam- the relative populations of the different rotamers occupied by ple all important chemically relevant structures. each amino acid, and this has led to a number of rotamer The efficient sampling of molecular geometries is a challeng- libraries being constructed.[35–40] A rotamer library is a com- ing problem due to the rapid increase in the available confor- prehensive guide, drawn from molecular dynamics simulation mational space as molecules grow in size. A systematic search

Journal of Computational Chemistry 2015, 36, 1844–1857 1845 FULL PAPER WWW.C-CHEM.ORG

of conformational space to find low energy structures is mum critical point, which typically coincides with the nucleus A. impractical and inefficient. A number of efficient approaches Topological atoms are defined in a parameter-free manner, and have been presented in the literature including the use of they are nonoverlapping and sharply bounded (at the inside of molecular dynamics,[54,55] Monte Carlo,[56] transition path sam- the molecule) by so-called interatomic surfaces. [57–59] [60] pling, and metadynamics. Additionally, fragment based It is a good idea to expand the 1/r12 expression occurring in approaches may be used to improve a systematic approach by the equation for the Coulomb energy between two electron reducing the number of conformations searched though elimi- densities. A popular and compact expansion introduces spheri- nation processes. An example of such an approach is that of cal harmonics, which in turn lead to atomic multipole Luo and coworkers[61] where, by fragmenting the Gly-Tyr-Gly- moments. Multipole moments are able to describe the aniso- [67] Arg tetrapeptide, they reduced 19.6 billion possible candidates tropy of the electron density, in contrast to (isotropic) point [68] for the global minimum conformation down to only 5760. charges used by popular force fields such as AMBER and [69] An alternative to computational sampling approaches for CHARMM. The charge of an atom is the zero-order term of finding important amino acid geometries is to source them the multipolar expansion, and it is only by including higher- from protein crystal structures. Unfortunately, crystal structures order terms that the anisotropy of the electron density is cannot be used directly as input into kriging models for sev- described. There is considerable evidence, as collected in a [5] eral reasons. First, only heavy atoms are detectable by X-ray recent review, of the advantages of multipolar electrostatics crystallography and so the hydrogen atom coordinates are over point charges. QCTFF incorporates multipolar electro- dependent upon the refinement process used. Second, remov- statics, and in the current work it is the atomic multipole ing an amino acid from a crystal structure breaks the peptide moments that are the topological property of interest, that is, bonds at either end of the backbone, which drastically they are the output that kriging is tasked to predict. changes the chemical environment and results in incomplete The Coulomb interaction between two topological atoms [70] valence of the terminal atoms. Therefore, some post-Protein XA and XB is given by Databank (PDB)-extraction modifications to the sampled amino X ECoul5 Q T Q (1) acids are required before input to QCTFF. Thirdly and finally, AB lAmA lAlBmAmB lBmB l l m m the resolution of the atomic coordinates varies from one crys- A B A B tal structure to another, and sometimes unrealistic bond where QlAmA is a multipole moment and TlAlBmAmB is the interac- lengths and angles may be present within a crystal structure. tion tensor between two multipole moments. A convenient To address the above concerns, a novel sampling approach is concept when dealing with the electrostatic interaction presented here. This approach samples amino acids from the between two multipole moments of order lA and lB is the PDB, relaxes bond lengths, and valence angles by an ab initio interaction rank, L, given by: method while preserving the dihedral angles, and then per-

forms nonstationary NM sampling around each sampled amino L5lA1lB11 (2) acid. This approach is termed PDB/NM and the details of both sampling approaches are explained in the following sections. It has been shown that interaction rank L55 provides a sat- isfactory description of the electrostatics acting in system.[71,72] Background and Methods Note that L55 requires all atomic multipole moments up to and including hexadecupole (fourth order multipole moments, Because many of the technical points concerning QCTFF have ‘ 5 4) to be calculated, resulting in 25 multipole moments for been described in detail in previous work of our lab, we only each atom. give a brief overview of the key concepts here. A comprehen- Atomic properties other than multipole moments may be sive introduction to kriging and how it features in QCTFF is obtained from QCT. The interacting quantum atoms (IQA)[73] given in Ref. [19] while Refs. [24,25] provides the most up-to- method is a well-developed topological energy decomposition date detail on the overall training procedure of QCTFF, now scheme based on the calculation[74] of the exact nonexpanded called GAIA. Additional descriptions of the machine learning topological Coulomb energy. IQA decomposes a molecular sys- method are also provided in Refs. [17,26]. tem in a combination of both intra-atomic (“self”) and intera- tomic energy terms. Details of the decomposition scheme are Quantum chemical topology beyond the scope of this article but QCTFF is currently incor- Underpinning the development of QCTFF[9] is Quantum Chemi- porating the non-Coulomb terms by the same kriging treat- ment as the atomic multipole moments in the current work. cal Topology (QCT),[62] which embraces all work[63] in quantum chemistry that uses the topological language of dynamical sys- The atomic local frame and kriging tems (e.g. attractor, basin, homeomorphism, gradient path, sep- aratrix, critical points). QCT contains the “quantum theory of QCTFF uses kriging,[11,13,75] also known as Gaussian process atoms in molecules”[64–66] as a special case where this topologi- regression,[12] which is a method of capturing the changes in cal language is applied to the electron density q and its Lapla- atomic multipole moments as a function of molecular geome- [25] cian. A topological atom XA is a bundle of gradient paths (i.e., try. A detailed description is provided in earlier work so trajectories of steepest ascent through q), terminating at a maxi- only a brief description is provided here. As the coordinates of

1846 Journal of Computational Chemistry 2015, 36, 1844–1857 WWW.CHEMISTRYVIEWS.COM WWW.C-CHEM.ORG FULL PAPER "# an atomic system evolve, for example when bonds stretch and Xd R 5cor e xi ; e xj 5exp 2 h xi 2xj ph (5) angles bend, the topological properties of the atoms involved ij ½ ð Þ ð Þ hj h hj h51 will change, e.g. example their atomic charges (or monopole moments). Using kriging, it is possible to build models capable where xi and xj are training points composed of d features. of predicting changes in an atomic property by evaluating the The parameters hh (hh 0) and ph (1 < ph 2) describe the molecular coordinates. In the present work, kriging models are importance of each feature h and may be written as the built for the first 25 atomic multipole moments (up to, and d-dimensional vectors h and p. A large value of hh corresponds including, hexadecapole moment) of each atom in the amino to a feature being highly correlated to the output multipole acids alanine (Ala) and lysine (Lys). By treating the atomic mul- moment. The parameter ph describes the smoothness of the tipole moments in this way, both polarization and charge function, and is often close to 2. transfer effects are captured. A second crucial concept underpinning kriging is the so-called A chemical system may be defined by a minimum of 3N26 concentrated (or reduced) log-likelihood function L^,definedas internal coordinates. In the language of machine learning, the 3N26 coordinates around an atom are referred to as features, n 1 L^ðh; pÞ52 logðr^2Þ2 logðjRjÞ (6) and it is these features that a multipole moment is mapped 2 2 to. In QCTFF an atomic local frame (ALF) is defined to describe the 3N26 coordinates around a central atom. Consider a cen- where tral atom, denoted A. First, the Cahn–Ingold–Prelog rules are ðy21l^ÞT R21ðy21l^Þ used to determine the two atoms of highest priority bonded r^25 (7) to A, and these atoms are termed X and Y in order of priority. n The distances RAX and RAY, and the angle hXAY define the three and ALF coordinates. Subsequently a right-handed coordinate sys- tem is stabilized using the XAY plane. All other atoms in the 1T R21y l^5 (8) system can then be described by three polar coordinates, 1T R211 RAK ; /AK , and hAK . One therefore obtains N23 sets of three spherical polar coordinates each, which combined with the where y is a vector of response values for each training point aforementioned ALF coordinates make up the 3N26 coordi- and 1 is a vector of 1s. Another (very different) machine learn- nates required, that is, 3(N23)13 5 3N26. ing method called particle swarm optimization[76] then Returning to kriging, the change in a given multipole moment searches for the optimum values of h and p that maximize the is smooth with respect to a change in the ALF coordinates. concentrated log-likelihood function. Therefore it is safe to interpolate the atomic multipole moments In Quantum Chemical Topology section, it was stated that of an unknown molecular geometry existing inside a set of each atom is described by 25 multipole moments, and there- known geometries. Kriging is used to build models capable of fore there are 25 kriging models associated with each atom. accurate interpolation of the atomic multipole moments by The kriging models are tested on an external test set of geo- mapping an input (nuclear coordinates) to an output (a multi- metries, which is strictly not part of the training set. For each pole moment). To achieve this, a training set of molecular geo- test molecule, we predict all the multipole moments of all metries with known atomic multipole moments is required. The the atoms in the system, and then calculate all electrostatic sampling of molecular geometries for training kriging models is interactions between atoms separated by a minimum of described below. Kriging models calculate atomic multipole three covalent bonds (i.e., 1, n and n > 3 interactions). Each moments of a new geometry by the following process: predicted interaction energy (between two atoms A and B) is then compared to the original (i.e., not trained) interaction Xn energy obtained from the original (i.e., not kriged) atomic y^ðx*Þ5l^1 ai ri (3) i51 multipole moments. Then the errors of all the aforemen- tioned interactions within one molecular geometry are where y^ðx*Þ is a multipole moment at a new set of coordi- summed. The absolute value of this summed error (for each nates x* and l^ is the global (average) value of the moment. test geometry) will be plotted against percentile (i.e., % of -1 The factor ai is the ith element of the vector a5R ðÞy21l^ test geometries) to obtain a called S-curve. Each point on and ri is the ith element of r, defined by such a curve corresponds to this final absolute error (i.e.,

|DEsystem|) in eq. (9)). The S-curve will be described later when 1 2 n T r5fcor½eðx*Þ; eðx Þ; cor½eðx*Þ; eðx Þ; :::; cor½eðx*Þ; eðx Þg (4) one is obtained. The complete description of errors just men- tioned is expressed is eq. (9), where T marks the transpose. X X Kriging treats all moments as an error from the global value, original predicted original predicted jDEsystemj5jEsystem 2Esystem j5 EAB 2 EAB and it is the correlation of these errors for a given multipole AB AB moment between all n training points that is calculated by krig- X original predicted ing. This is achieved by building a n3n correlation matrix R 5 EAB 2EAB (9) AB between all pairs of training points with elements Rij, given by

Journal of Computational Chemistry 2015, 36, 1844–1857 1847 FULL PAPER WWW.C-CHEM.ORG

PDB sampling method

PDB sampling is performed by the in-house (scripting) code MOROS and is used to extract all seed geometries of a particu- lar amino acid from a set of crystal structures. A list of the 260 PDB crystal structure codes sampled from is provided in Part A

of the Supporting Information. Hydrogen atoms were added Figure 1. Diagrammatic representation of the atoms extracted by MOROS to all protein crystal structures using the HAAD code of Li including the target amino acid (blue box) and also the full set of atoms et al.[77] The HAAD algorithm was developed to add accurately including those used to make the peptide caps (red box). hydrogen atoms by analyzing the positions of nearby heavy atoms, following the basic rules of orbital hybridization and weighted Hessian, H, the frequency of each of the Nvib 5 3N – through optimization of steric and electrostatic parameters. 6 NM is evaluated. These Nvib NM are orthogonal and form a HAAD was found to outperform the popular software complete basis within which internal molecular motions can CHARMM and REDUCE[78] with the RMSD of predicted hydro- be described. With the mass-weighted force vector, F, a set of gen atom positions decreased by 26% and 11%, respectively, Nvib harmonic equations of motion is obtained. These equa- when compared to high resolution X-ray and neutron diffrac- tions of motion allow us to distort the molecular geometries, tion structures. MOROS returns as output “capped” amino acids and perform a sampling of conformational space. We now discuss the computational means utilized to obtain meaning that H3CC(@O)A and AN(H)CH3 are appended at the N and C termini of the sampled amino acid, respectively. These the various parameters required to evolve the NM. This subse- atoms are included so that the peptide bonds remain intact, quently permits us to obtain a set of geometries we consider and thereby yield a more realistic representation of an amino representative of realistic vibrational states of a molecular sys- acid while present in a protein. The capping groups are built by tem. What follows is a brief paraphrase of the excellent explana- [79] extracting the atomic coordinates from the residues preceding tion given by Ochterski. Beginning with the transformation and following the residue of interest. Figure 1 shows the atoms from the mass-weighted Cartesian coordinates, q, to the set of extracted by MOROS including the amino acid of interest (blue Nvib internal coordinates, s, we construct the 3N33N transfor- box), and also atoms that make up the caps (red box). mation matrix, D,satisfying In preparation for nonstationary NM treatment, the sampled s5 q (10) amino acid geometries are then allowed to partially geometry-relax, D that is under the restriction of fixed dihedral angles. This stage is Outlining the construction of D is beyond the scope of this important as it removes some of the outlying bond lengths origi- article. Suffice to say that six orthonormal vectors occupy the nally present due to the poor quality crystal structure resolution. first six columns of D, and correspond to the global transla- The next step in PDB sampling is to perform a frequency tional and rotational motions of the system (as given by the calculation on each amino acid geometry, by first obtaining Sayvetz conditions). The remaining N vectors are generated the Hessian of the potential energy on that point of the sur- vib by means of a Gram–Schmidt orthonormalization procedure. face, for input for the non-stationary NM sampling of the The mass-weighted force F and the mass-weighted Hessian geometry. A choice must be made regarding the number of H, both outlined in Part B of the Supporting Information, are PDB-sampled amino acid geometries to use as input for non- transformed into the internal coordinate basis, by use of D stationary NM, as this choice influences the number of geome- tries sampled using NM. This choice is investigated in Results > Fs5D Fq Hs5D HqD (11) and Discussion section, and unless otherwise stated, 300 ran- dom PDB-sampled amino acid geometries are input to the where the subscripts denote the basis in which these quantities nonstationary NM. The combined PDB and nonstationary NM are expressed and T denotes the transpose. To evaluate the fre-

sampling method will henceforth be referred to as PDB/NM. quencies of the various modes of motion, we diagonalize Hs,

-1 “Normal Modes” sampling E HsE5Ik (12) Typical normal mode analysis is conducted at an energetic where E denote the eigenvectors of Hs and I is the identity minimum (or stationary point) on the molecular potential matrix. The resultant eigenvalues, ðÞIk ii5ki, are related to the energy surface. However, the mathematics leading to NM does mode frequencies, mi,by not restrict their use only at stationary points. A simple gener- rffiffiffiffiffiffiffiffiffiffiffiffi alization of the derivation of the molecular NM enables their k m 5 i 8i51; ...; 3N (13) evaluation at nonstationary points on the potential energy sur- i 4p2c2 face. This derivation is provided in Part B of the Supporting Information. In the following, we present a conformational where c is a factor comprising the speed of light and the con- sampling methodology, which uses these “non-stationary point version between atomic units and cm21. Of course, six of normal modes” as a means for distorting a molecule, that is, these frequencies correspond to the global translational and

sample its configurations. By diagonalization of the mass- rotational degrees of freedom of the system, thus yielding Nvib

1848 Journal of Computational Chemistry 2015, 36, 1844–1857 WWW.CHEMISTRYVIEWS.COM WWW.C-CHEM.ORG FULL PAPER nonzero frequencies. The reduced masses and force constants, able thermal energy being placed into one mode. If this mode corresponding to the modes with nonvanishing frequency, are is strongly linked to the motion of a bond length or valence given by similar manipulations of these quantities. The reader angle, then there is the potential for sampling nonphysical is again directed to Ochterski[79] for a discussion of their geometries. We have implemented a filtering procedure that calculation. prevents the output of such nonphysical geometries. Consider

The amplitude of the ith mode, Ai, is given by rearrange- a bond between atoms A and B, of length ‘AB, within a seed ment of the familiar expression for the energy of a simple har- geometry. If ‘AB exceeds a value of kBOND multiplied by the monic oscillator sum of the atomic covalent radii, (rA 1 rB), then the geometry rffiffiffiffiffi is considered nonphysical and rejected. Similarly, if ‘AB is lower 2E than the inverse of kBOND multiplied by the sum of the atomic Ai5 (14) ki covalent radii, the bond is considered too short and rejected. In other words, every bond length must obey the inequality where ki is the force constant of the mode of motion, and E is (1/kBOND)(rA 1 rB) ‘AB kBOND(rA1 rB). Valence angles the energy available to it. We now have all quantities required undergo a similar treatment, so that given any valence angle to evolve the modes of motion and replicate the vibrational of the seed geometry, a0, the corresponding valence angle of dynamics of the system. The total energy available to the sys- the sampled geometry, a, must obey the inequality a0/kANGLE tem is given by the expression for thermal energy, E5NvibkT=2, a kANGLEa0. In the following work, the “stretching” parameters, and is stochastically distributed throughout the modes. A tem- kBOND and kANGLE, were both set to 1.20. The conceptual con- perature of 298 K was used throughout this work. The phase cern that we mentioned is that distributing the thermal energy 5 factors of the modes, /, are also randomly assigned: if / 0for stochastically throughout the modes is nonphysical in terms of all modes, then they oscillate in unison, which is physically equilibrium thermodynamics. For our purposes we are more unrealistic. Instead, we assume the modes to resonate out of interested in sufficiently large sampling domain. phase with one another, as energy transfer to each mode from The sole remaining issue is the choice of a dynamical time an external heat bath will be strongly decoherent. step with which to evolve the various modes of motion. We Let us note that the average thermal energy available to ensure that a single oscillation of a mode is sampled uni- each mode will comply with a standard equipartition of formly. In other words, for a complete cycle of the ith har- energy for a physically realistic sampling methodology. The monic equation of motion, the time period of the mode is energy available to each mode is then subjected to small sto- Ti51=mi. A parameter, ncycle, defines the number points to be chastic fluctuations. However, one deduces from the above evaluated along a single cycle of the harmonic equation of description of our own methodology that we did not follow motion. From this, we define the quantity Dti5Ti=ncycle, which the route of equipartition. The driving force for this decision is the dynamical timestep for the equation of motion. The was to increase the domain of conformational space, which is quantity ncycle is left as a user-defined input, and is set to then accessible to our sampling methodology. As explained ncycle = 10 from now on. Additionally, the distribution of the above, we have chosen to distribute the total thermal energy total energy throughout the modes is considered a dynamic stochastically through all modes. Given a standard equiparti- quantity, and so for every nreset samples that are output, the tion of thermal energy, the ith mode, q , is limited to the i energy is randomly redistributed throughout the system. The domain q 0 2 A /2 q q 0 1 A /2, where q 0 is the reference i i i i i i phase factors are also redefined at the same frequency. Again, state of the mode and Ai is given in eq. (14). However, by sto- nreset is left as a user-defined parameter, and is set as nreset5 2 chastically distributing the thermal energy through the modes, in the following. A further justification for the way we sample the energy available to the ith mode, E , can then take any i is given in Part C of the Supporting Information. value in the range 0 Ei nkBT/2, as long as the sum of the E is nk T/2. In this sense the currently applied methodology is i B Computational details more general than that of the equipartition. If Ei takes the value of kBT/2 for all modes, then the sampling domain coin- Sampling of amino acids from the crystal structures was per- cides with the sampling domain of a standard equipartition of formed by the in-house code MOROS while the in-house FOR- energy. However, all other combinations of the Ei have differ- TRAN code TYCHE distorted the geometries according to NM. ent sampling domains. The sampling domain that is accessible The fully automated GAIA code (formerly named AUTOLINE in to our stochastic distribution of thermal energy through the previous work) was used to build the training and test sets of modes is then the union of all sampling domains that arise molecular geometries (Fig. 2). An expanded flow chart of the [24] from all possible combinations of the Ei. We therefore obtain GAIA procedure is given by Fletcher et al. Once the sampled the largest sampling domain possible for our methodology, amino acid geometries were obtained from either PDB/NM or which is necessary for the construction of a widely applicable NM, the molecular wave function for each geometry was kriging model. obtained at the B3LYP/aug-cc-pVDZ level using GAUSSIAN09.[80] Two issues arise with stochastically distributing the thermal The FORTRAN program AIMAll[81] obtained the atomic multipole energy through the modes, one methodological and one con- moments. The parameters briaq 5 auto and boaq 5 high, which ceptual. The methodological concern is that there is a non- are standard in GAIA, because boaq 5 highhasbeenseeninthe negligible probability for a significant proportion of the avail- past as a good compromise between accuracy and speed.

Journal of Computational Chemistry 2015, 36, 1844–1857 1849 FULL PAPER WWW.C-CHEM.ORG

Alanine was chosen because it is the smallest amino acid with a (nontrivial) side chain. Because there is only one side

chain dihedral angle (v1), as opposed to the four dihedral

angles (v1, v2, v3, v4) controlling the side chain of lysine, the / and w angles dominate the dihedral motion of alanine. Lysine has the most flexible side chain of all 20 naturally occurring amino acids, and therefore has been chosen as a rigorous test of the performance of kriging when dealing with highly flexi- ble molecules. Figure 3 shows the four side chain dihedrals in

lysine around CAC bonds or v1, v2, v3, and v4.

Testing the PDB/NM sampling approach

Kriging models were built for the amino acids Ala and Lys using the four sampling strategies defined in Table 1. Rama- chandran plots for the sampled alanine geometries by each of Figure 2. The fully automated GAIA protocol followed to obtain and to test kriging models. the sampling methods are shown in Figure 4. The dihedral angles are fixed to the same values in both the PDB_OPT and Kriging models were built and then tested using the in-house PDB_NO_OPT approach, which is why Figure 4 assigns the codes FEREBUS and NYX, respectively. All kriging models were same color (blue) to the distribution of w and / angles of their built using Ntrain 5 1000 training geometries and were tested on geometries. As expected, the PDB-sampled Ramachandran 400 randomly selected geometries from the remaining 1000. plots for both Ala and Lys display a sampling bias toward the Experience has shown that kriging models deteriorate in predic- a-helix and b-sheet regions with additional clusters of geome- tion quality as the standard integration error (i.e., the familiar tries in the left-handed helix region. The green Ramachandran Lagrangian L of atom X or L(X)) increases. Hence it is best to set plots display the sampled geometries obtained by the NM L(X) as low as possible but this norm causes an increasing num- method. A number of islands of geometries around the gas- ber of integrations to have to be discarded. A good compromise phase energy minima are observed. Several islands are clearly is allowing a maximum integration error of L(X)=0.001 a.u. This disconnected but some may overlap, such as the long island value was enforced throughout this work, which keeps the num- in lysine (bottom box) at the bottom right of the whole cluster ber of discarded atoms reasonable but not nil, explaining the of islands. Because there are regions of conformational space surplus of sampled geometries at the outset. populated by the PDB sampling approaches but not the NM approach, we conclude here that NM sampling from gas phase Results and Discussion energy minima is inadequate for building kriging models to be used in biomolecular simulation. This is most noticeable in Kriging models were built for the two amino acids alanine the case of Lys, where the NM Ramachandran plot appears (Ala) and lysine (Lys) using geometries sampled from four dif- sparsely populated compared to both the other sampling ferent sampling approaches: PDB_NO_OPT, PDB_OPT, NM and methods and the Ala NM Ramachandran plot. This is because PDB/NM. These four methods are described in Table 1. the side chain of lysine is very flexible, and for each of the nine actual islands in the Ramachandran plot, there are multi- ple overlapping energy minima with different side chain con- Table 1. An overview of the four sampling approaches. formations. This explains why the 39 input minima only PDB_OPT Molecular geometries sampled directly from crys- tal structure coordinates and H atoms added by the HAAD program. GAUSSIAN fully optimizes bond lengths and valence angles but all dihe- dral angles remain fixed. PDB_NO_OPT Molecular geometries taken directly from PDB coordinates and H atoms added by HAAD. Single-point GAUSSIAN calculations without any geometry relaxation. NM Standard NM sampling procedure using TYCHE to sample molecular geometries from a number of local energy minima in the gas phase. The local energy minima themselves are not included in either training or test sets. PDB/NM 300 randomly selected PDB “seed geometries” sampled with PDB_OPT, each acquiring 7 geo- metries generated from the nonstationary NM. Figure 3. The four dihedral angles in the side chain of Lys, referred to as v The “seed geometries” themselves are not 1 (blue), v (red), v (green), and v (purple). [Color figure can be viewed in included in either training or test sets. 2 3 4 the online issue, which is available at wileyonlinelibrary.com.]

1850 Journal of Computational Chemistry 2015, 36, 1844–1857 WWW.CHEMISTRYVIEWS.COM WWW.C-CHEM.ORG FULL PAPER

Figure 4. Ramachandran plots of Ala (top box) and Lys (bottom box) sampled using PDB_OPT and PDB_NO_OPT (blue), NM (green), and PDB/NM (orange). In the bottom right panel of the top box is a guide to the regions corresponding to the secondary structural motifs, b-sheet (labeled b), a-helix (labeled a), and left-handed alpha helix (labeled LH). [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

appear as nine islands on the Ramachandran. The orange vn (n 5 1, 2, 3, or 4) can adopt, that is, from 21808 to 1808. Ramachandran plots, containing the Ala and Lys geometries Each sampled geometry then corresponds to a quadruplet of sampled by the PDB/NM approach, strongly resemble the dihedral values (v1, v2, v3, v4), each marked by a point on each plots of both PDB_OPT (blue) and PDB_NO_OPT (blue) but of the four corresponding axes. These four points are then with fewer points in regions away from the a-helix and b-sheet linked by four colored lines, which form a (typically lozenge- region. This is because the 300 “seed” geometries used as like) pattern. From the density of these patterns one obtains input for the NM sampling were randomly selected from the an instant glimpse of the conformational diversity (or lack PDB_OPT sampled geometries and, statistically, they are most thereof) of the side chain geometries. likely to be sampled from these well populated a-helix and Clearly, the NM sampling approach (green) samples a very b-sheet regions. The benefit of PDB/NM (orange) is that, on limited range of side chain geometries and does not return top of realistic distributions of dihedral angles, bond lengths the regions of high sampling frequency obtained by the and angles are more realistic and they are both varied. PDB_OPT and PDB_NO_OPT (blue) approaches. For example, 2 Figure 5 shows so-called spider plots of the side chain dihe- the gauche (2608) conformation of v1 is the most sampled dral angles sampled by each of the sampling approaches. In a conformation in the protein crystal structures but this confor- spider plot, each of the four axes (meeting at the origin) corre- mation is not at all present in NM. The preference of v1 to be sponds to all values that each of the four side chain dihedrals in the gauche2 conformation in proteins is a well-documented

Journal of Computational Chemistry 2015, 36, 1844–1857 1851 FULL PAPER WWW.C-CHEM.ORG

Figure 5. Spider plots displaying the Lys side chain conformations sampled by each of the four sampling approaches: PDB_OPT and PDB_NO_OPT (blue), NM (green), and PDB/NM (orange). Each axis ranges from 21808 to 1808. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

phenomenon[35] and thus NM sampling’s shortcomings are geometries. Here, the lysine geometries sampled by the PDB/ highlighted. The PDB/NM spider plot (orange) shows a better NM method have the largest range in ab initio energy, 421 kJ sampling of side chain dihedral angles than that of NM. How- mol21, which is much larger than found in any other sampling ever, the former shows a sparser sampling of the less popu- approach. This is expected as the PDB/NM geometries lated combinations of dihedral angles compared to PDB_OPT undergo substantial dihedral sampling, as well as bond length and PDB_NO_OPT (blue). and angle distortions caused by the nonstationary NM Table 2 presents a summary of the relative performance of sampling. each sampling approach and the resulting kriging model accu- Table 2 also lists the average bond length range for all racy for both amino acids. The range in the B3LYP/aug-cc- bonded atom pairs in the sampled Ala and Lys geometries, pVDZ energy of the Ala and Lys geometries sampled by each calculated for each sampling method. For both Ala and Lys, of the four methods is also included in Table 2. For both PDB_OPT yields the lowest average bond length range, 0.02 A˚, amino acids the NM sampled geometries show the smallest due to the relaxation of the bonds to their optimal lengths range in ab initio energy. This is because the NM sampling (and obviously no bond length variation is introduced by NM). method uses the lowest energy gas phase conformations as The average bond length ranges of 0.07 A˚ and 0.08 A˚ for the input minima, and hence all sampled geometries from this PDB_NO_OPT Ala and Lys, respectively, are the next lowest val- method are distortions of these low energy geometries. There- ues. The reason for the low average bond length range of the fore, large deviations from the various energy minima cannot PDB_NO_OPT geometries is that the hydrogen addition soft- occur because the distorted geometries are confined by their ware used, HAAD, add hydrogens at a fixed length of 0.985 A˚. respective well. This situation is different to that found in PDB Therefore, the average range in bond length is reduced by all bonds containing a hydrogen atom. A more informative metric to describe the sampling of bond lengths by each method is Table 2. Statistical information detailing the sampling of Ala and Lys by to study the range of a single bond containing two heavy the four sampling methods. atoms. The bond between Ca and Cb was chosen for this pur- PDB/ pose. Again, the PDB_OPT showed the lowest ranges of 0.03 PDB_OPT PDB_NO_OPT NM NM and 0.05 A˚, respectively, but the PDB_NO_OPT Ala geometries

Alanine showed the highest range in CaACb distance of 0.22 A˚ as Range in ab initio Energy 132.5 281.0 84.4 111.0 expected. NM and PDB/NM showed the same range in CaACb Average Bond 0.02 0.07 0.11 0.12 bond length of 0.14 A˚. This highlights the similarity of both Length Range[a]

CaACb Bond Length Range 0.03 0.22 0.14 0.14 the stationary and nonstationary NM sampling algorithms in [b] Average jDEsystemj 0.7 1.8 4.0 3.4 TYCHE. original predicted Average jEAB 2EAB j 0.1 0.2 0.4 0.4 Kriging models were built for both Ala and Lys using 1000 Max jDEsystemj 6.8 25.8 18.4 17.2 original predicted molecular geometries obtained from each of the four sampling Max jEAB 2EAB j 10.0 9.4 13.7 9.4 Lysine approaches and were tested on 400 previously unseen (i.e., Range in ab initio Energy 126.0 310.6 111.1 420.9 external and not trained for) molecular geometries obtained Average Bond 0.02 0.08 0.13 0.14 by the corresponding sampling approach. For example, kriging Length Range[a] models built using geometries sampled using the PDB_NO_OPT CaACb Bond Length Range 0.05 0.12 0.13 0.13 Average jDEsystemj 1.6 2.5 3.3 3.8 method were tested on PDB_NO_OPT geometries, PDB/NM original predicted Average jEAB 2EAB j 0.2 0.3 0.3 0.4 kriging models were tested on PDB/NM geometries, etc. Figure Max jDEsystemj 20.4 23.1 15.2 18.1 6 shows the S-curves for all four sampling methods. As an Max jEoriginal2Epredictedj 32.5 34.2 7.1 28.4 AB AB example of how to read such an S-curve: 88% of geometries in 21 ˚ All energies are in kJ mol and all distances in A. [a] The set of train- the external test set for alanine’s PDB_NO_OPT kriging models ing geometries provides a range (i.e., maximum–minimum) for each 21 bond length. The ranges of all bonds appearing in the system are then (top, red curve) have an error of maximum 4 kJ mol (or 1 averaged (over these bonds). [b] The symbols referring to all energetic kcal mol21) (where the red curve intersects the purple dashed quantities (except the range) in this table also appear in eq. (9). line). The more the S-curve is situated at the left of the plot,

1852 Journal of Computational Chemistry 2015, 36, 1844–1857 WWW.CHEMISTRYVIEWS.COM WWW.C-CHEM.ORG FULL PAPER

second left-most S-curve corresponds to the predictions made using the models built using PDB_NO_OPT geometries (red curve). This is most likely a result of the lack of bond length variation of all hydrogen-containing bonds. However, the PDB_NO_OPT does have the highest maximum total error of all sampling approaches, amounting to 25.8 kJ mol21, despite the low average error. This is attributable to an alanine residue extracted from a crystal structure with a significantly stretched

CaACb bond length and the HaACaACb angle of 1158, which is significantly distorted from the stationary value of 1088. This fact illustrates the unsuitability of sampling amino acid geome- tries directly from crystal structures for QCTFF development, and emphasizes the need for a PDB/NM hybrid sampling approach. The kriging models obtained from the PDB/NM and NM sampled geometries perform worst overall, which is due to the large quantity of bond length sampling relative to the PDB_OPT and PDB_NO_OPT approaches. Despite being the S-curves furthest to the right, PDB/NM and NM have average S-curve errors of only 3.4 and 4.0 kJ mol21, respectively. More than 60% of the test geometries of alanine were predicted by kriging models with an error of less than 1 kcal mol21, a value often described as “chemical accuracy.” It is interesting to note that the dihedral sampling appears to have less effect on the difficulty of the kriging problem than well-sampled bond lengths. Figure 7 plots the average bond length range against average total (S-curve) error for all four sampling approaches for Ala. The correlation between P 1 Ntrain bond length and average S-curve error ( jDEi;systemj) Ntrain i51 is fairly strong, with an R2 value of 0.90 (see Fig. 7). To illus- trate this point further, the difference in average total error

(S-curve error or |DEsystem|) between PDB/NM and NM is 0.6 kJ mol21 (see Table 2), although the PDB/NM approach sam- ples a much larger range of dihedral conformational space than NM. In contrast to this, PDB_OPT, which has a much Figure 6. Errors in the predicted total electrostatic interaction energies (1– 4 and higher) of alanine (top) and lysine (bottom) for kriging models larger sampling of dihedral space than NM but also the small- trained with molecular geometries obtained by: PDB_OPT (blue), est average range of bond lengths, has an average total error PDB_NO_OPT (red), NM (green), and PDB/NM (orange). The dashed purple 3.3 kJ mol21 lower than that of NM. This observation is a 21 lines mark the 1 kcal mol threshold. [Color figure can be viewed in the result of the following effect. Under the assumption of an online issue, which is available at wileyonlinelibrary.com.] identical dihedral sampling (as is the case for PDB_NO_OPT the more accurate the model that it describes. The error dis- and PDB_OPT), increasing the range of bond lengths played by an S-curve corresponds to that given by eq. (9), that increases the volume of configurational space that the krig- P original predicted ing models have to describe. This increase results in a more is, EAB 2EAB . As such, each point on an S-curve cor- A;B difficult kriging problem leading to increased prediction responds to the absolute value of the sum of the errors of all errors. It also is observed that changing a bond length has a predicted Coulombic interactions between pairs of atoms in dominant effect on the multipole moments of the atoms one test molecular geometry, relative to the original interaction involved. This is illustrated in Supporting Information Figures energies. This value is referred to as both the “total absolute S1–S3 where plots of Ca charge against both NACa bond error” and also the “S-curve error.” length and backbone w angle are provided for the Ala geo- In connection with the information shown in Figure 6, note metries sampled by the PDB/NM, PDB_OPT and NM that Table 2 also reports the average absolute total error and approaches, respectively. In both the PDB/NM and NM the highest total error for each S-curve. The alanine models sampled plots, the Ca charge shows correlation with the built using PDB_OPT geometries (blue curve) had the lowest NACa bond length but not with the w angle. It is only in average error of 0.7 kJ mol21. This is attributable to the lack of the plots obtained from the PDB_OPT geometries (where the bond length and angle variation in the training and test sets NACa bond length range is significantly reduced as a result and so the kriging problem is “less challenging” as there are of partial geometry relaxation) that any correlation between fewer dimensions of conformational space being sampled. The Ca charge and w can be seen. In summary, the correlation

Journal of Computational Chemistry 2015, 36, 1844–1857 1853 FULL PAPER WWW.C-CHEM.ORG

Figure 7. Average bond length deviation against average total (S-curve) error for the different sampling approaches of Ala (left) and Lys (right): PDB_OPT (blue), PDB_NO_OPT (red), NM (green), and PDB/NM (orange). All data taken from Table 2. [Color figure can be viewed in the online issue, which is avail- able at wileyonlinelibrary.com.]

patterns above prove the dominance of bond length varia- PDB_NO_OPT point on the Ala S-curve, there is no clear struc- tion over dihedral sampling in posing a challenge to kriging. tural reason behind the highest energy geometry. This could The same conclusions may be drawn from the Lys S-curves indicate that the geometry lies outside of the configurational as from the Ala S-curves: average bond length deviation is the space of the training set. The overall shape of an S-curve may be most import factor dictating the average S-curve error (Fig. 7), related to the quality of the test geometries and the range of and although larger dihedral sampling increases the average conformational space. For example, the NM S-curve (green) is error, it does this to a lesser extent than a large average bond steep with only a small bend at the top. This is a result of the length deviation. PDB_OPT has the lowest average S-curve relatively small set of seed geometries causing the sampled geo- error (Lys: 1.6 kJ mol21 and Ala: 0.7 kJ mol21) due to the opti- metries to be clustered close together. Therefore all test geome- mized bond lengths having the lowest average deviation tries are close to a training geometry within the kriging model (0.02 A˚ for both ALa and Lys). The PDB/NM S-curve has the high- and the errors remain constant throughout. In contrast, the est average error due to having the largest average bond length PDB_NO_OPT (red) geometries are not clustered together and deviation and also a large dihedral sampling. PDB_NO_OPT has therefore the test geometries can be further away from the near- the largest maximum S-curve error but, unlike the high error est training set geometry leading to larger errors. This gives rise

Figure 8. Individual intramolecular interaction prediction errors in Ala against interaction distance obtained for models built using the four sampling approaches: PDB_OPT (blue), PDB_NO_OPT (red), NM (green), and PDB/NM (orange). [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

1854 Journal of Computational Chemistry 2015, 36, 1844–1857 WWW.CHEMISTRYVIEWS.COM WWW.C-CHEM.ORG FULL PAPER

deviations for Lys (0.5 kJ mol21 and 0.8 kJ mol21, respectively) Table 3. Standard deviation of interaction prediction errors for both Ala 21 21 and Lys from kriging models built from geometries sampled from the than for Ala (0.2 kJ mol and 0.4 kJ mol , respectively) as is four sampling approaches (kJ mol21). expected by comparison of the blue and green plots in Fig- ures 8 and Supporting Information S4. The PDB/NM interac- Sampling Ala Lys tions in Lys also have a larger standard deviation (0.7 kJ PDB_OPT 0.2 0.5 mol21) than the PDB_NM interactions in Ala (0.6 kJ mol21). PDB_NO_OPT 0.4 0.8 NM 0.7 0.5 Larger standard deviations emerge for Lys because it is a PDB/NM 0.6 0.7 larger, more flexible molecule than Ala and so the kriging problem for PDB sampled geometries is much harder. Thus the kriging model is unable to find as good a solution for Lys to the less steep climb of this S-curve and its longer tail toward than for Ala. the 100% ceiling. Each point on the S-curve is a sum of all 1,4 and higher intramolecular interaction prediction errors within a single test P Optimum ratio of input geometries to sampled geometries original predicted geometry (j ABðEAB 2EAB Þj from eq. (9)). Because of for the PDB/NM sampling approach the sum, potential cancellation of positive and negative inter- action errors is included within the S-curve. To increase the The hybrid PDB/NM sampling approach has been presented as transparency of the results we now focus on the construction a means of sampling chemically relevant amino acid geome- of the S-curve. Figure 8 shows all interaction errors for all Ala tries for kriging models, taking advantage of the benefits test geometries plotted against interaction distance for each afforded by both PDB and NM sampling whilst avoiding the sampling approach. The maximum absolute interaction error problems associated with either method. The ratio (denoted original predicted 1:n) of PDB-seed geometries (set to 1) to nonstationary NM (max jE 2E jÞ and average absolute interaction AB AB sampled geometries (set to n) will now be discussed. The max- error (average jEoriginal2Epredictedj) for each approach is included AB AB imum dihedral sampling corresponds to a 1:1 ratio of PDB in Table 2. Supporting Information Figure S4 shows a plot anal- sampled “seed geometries” to NM sampled geometries. How- ogous to Figure 8 but for the sampled Lys geometries. The ever, this ratio is computationally expensive because each average absolute interaction errors follow the same trend as the PDB-sampled amino acid seed geometry then needs to be par- total S-curve error (PDB/NM NM > PDB_NO_OPT > PDB_OPT). tially geometry-relaxed. Conversely, a ratio smaller than 1:1 For all sampling approaches used, the largest average absolute (i.e., 1:n where n>1) requires fewer geometry optimizations, interaction error was only 0.4 kJ mol21 (NM and PDB/NM but decreases the sampling of (dihedral) conformational space. sampled geometries). The correlation between average absolute interaction error and total error is very high with an R2 of 0.97 A smaller number of sampled geometries per PDB-seed geom- for Ala and 0.99 for Lys. The plots of the average interaction etry will also affect the difficulty of the kriging problem as the prediction error versus the total error can be seen in Supporting sampling of conformational space will increase (assuming a Information Figure S5. constant training set size). The standard deviation of the interaction errors for each Training sets have been built, using the PDB/NM sampling method is provided in Table 3 for both Ala and Lys. Both approach, for ratios of seed geometries to NM-sampled geo- PDB_OPT and PDB_NO_OPT have significantly larger standard metries of 1:20, 1:10, 1:4, 1:2, and 1:1, always with a total of 1200 NM-sampled geometries in each case. These geometries were randomly reshuffled and then kriging models were built using 800 training geometries, and were tested on 400 (exter- nal) geometries.

Figure 9. Errors in the predicted total 1–4 and higher electrostatic interac- tion energies of lysine by kriging models trained with molecular geome- tries obtained by the PDB/NM approach with different numbers of PDB- Figure 10. Average total error versus the number of PDB seed geometries seed geometries (see key on graph, 1200 corresponds to the 1:1 ratio in for kriging models of lysine obtained from the PDB/NM sampling method- the main text). [Color figure can be viewed in the online issue, which is ology. [Color figure can be viewed in the online issue, which is available at available at wileyonlinelibrary.com.] wileyonlinelibrary.com.]

Journal of Computational Chemistry 2015, 36, 1844–1857 1855 FULL PAPER WWW.C-CHEM.ORG

Figure 9 shows the total energy S-curve obtained for each Keywords: quantum theory of atoms in molecules quantum training set. Increasing the number of PDB-seed geometries chemical topology conformational sampling kriging electro- does not significantly reduce the quality of the kriging model statics protein data bank obtained. The average values of the S-curve energies have been plotted against the number of input minima in Figure 10. There is a trend for a larger number of PDB-seed geome- How to cite this article: T. J. Hughes, S. Cardamone, P. L. A. tries to have a higher average S-curve error, but not dramati- Popelier. J. Comput. Chem. 2015, 36, 1844–1857. DOI: 10.1002/ cally so. The range of errors is only 0.6 kJ mol21, between a jcc.24006 1:20 ratio of PDB-seed geometries to sampled geometries (average error of 3.8 kJ mol21) and a 1:1 ratio (average error ] Additional Supporting Information may be found in the of 4.4 kJ mol21). online version of this article.

[1] J. W. Ponder, C. Wu, V. S. Pande, J. D. Chodera, M. J. Schnieders, I. Conclusions Haque, D. L. Mobley, D. S. Lambrecht, R. A. J. Di Stasio, M. Head- Gordon, G. N. I. Clark, M. E. Johnson, T. Head-Gordon, J. Phys. Chem. B. The topological force field QCTFF contains a machine learning 2010, 114, 2549. component that handles polarization and charge transfer (in a [2] J. G. Vinter, J. Comput. Aided Mol. Des. 1994, 8, 653. unified way). The machine learning method used, called krig- [3] N. Gresh, G. A. Cisneros, T. A. Darden, J.-P. Piquemal, J. Chem. Theory Comput. 2007, 3, 1960. ing, needs a data set of molecular geometries to train on. [4] T. Verstraelen, S. Vandenbrande, P. W. Ayers, J. Chem. Phys. 2014, 141, Here we focus on obtaining a more realistic and relevant train- 194114. ing set for amino acids. Before the current study, we sampled [5] S. Cardamone, T. J. Hughes, P. L. A. Popelier, Phys. Chem. Chem. Phys. 2014, 16, 10367. the training set by distorting the local energy minima of [6] C. Kramer, A. Spinn, K. R. Liedl, J. Chem. Theory Comput. 2014, 10 (peptide-capped) amino acids (in the gas phase) according to 4488. NM obtained at those stationary points. Using the Protein [7] S. Grimme, J. Chem. Theor. Comput. 2014, 4497. Data Bank (PDB) we show here that these gas phase stationary [8] S. K. Burger, M. Lacasse, T. Verstraelen, J. Drewry, P. Gunning, P. W. Ayers, J. Chem. Theor. Comput. 2014, 8, 554. points miss a number of important amino acid geometries [9] P. L. A. Popelier, Int. J. Quantum Chem. 2015, 115, 1005. Doi: 10.1002/ that are present in a folded protein. qua.24900. We present a new sampling approach that combines sam- [10] P. L. A. Popelier, AIP Conf. Proc. 2012, 1456, 261. pling of amino acid geometries from the Protein Data Bank [11] G. Matheron, Econ. Geol. 1963, 58, 21. [12] C. E. Rasmussen, C. K. I. Williams, Gaussian Processes for Machine (PDB) with nonstationary NM (NM) distortion. To the best of Learning; The MIT Press: Cambridge, 2006. our knowledge the latter technique has not been attempted [13] D. G. Krige, J. Chem. Metall. Min. Soc. South Afr. 1951, 52, 119. before. This hybrid approach is called PDB/NM and is tested [14] A. P. Bartok, M. J. Gillan, F. R. Manby, G. Csanyi, Phys. Rev. B 2013, 88, 054104. on alanine and lysine, the most flexible amino acid of all. The [15] M. A. Cuendet, M. E. Tuckerman, J. Chem. Theory Comput. 2014, 10, use of the PDB greatly expands the sampling in the space of 2975. dihedral angles, both in range and density. Does this expan- [16] M. J. L. Mills, G. I. Hawe, C. M. Handley, P. L. A. Popelier, Phys. Chem. sion lead to worse kriging models, given the larger variation Chem. Phys. 2013, 15, 18249. [17] M. J. L. Mills, P. L. A. Popelier, Theor. Chem. Acc. 2012, 131, 1137. and diversity in dihedral angles? The answer is negative [18] M. Rupp, A. Tkatchenko, K.-R. Mueller, O. A. von Lilienfeld, Phys. Rev. because it turns out that the range in bond lengths is actually Lett. 2012, 108, 058301. the prime factor in determining the difficulty and hence the [19] T. Stecher, N. Bernstein, G. Csanyi, J. Chem. Theor. Comput. 2014, 10, 4079. predictive accuracy of the kriging models. As a result, the new [20] Y. Yuan, M. J. L. Mills, P. L. A. Popelier, F. Jensen, J. Phys. Chem. A 2014, PDB/NM sampling method (which is more “informed”) per- 118, 7876. forms as well as the original “gas phase energy minimum” NM [21] M. J. L. Mills, P. L. A. Popelier, Comput.Theor. Chem. 2011, 975, 42. sampling. All kriging models lead to very good electrostatic [22] M. J. L. Mills, School of Chemistry, PhD Thesis. University of Manches- ter, Manchester, Great Britain, 2011. energy prediction errors where more than 60% of external test [23] Y. Yuan, M. J. L. Mills, P. L. A. Popelier, J. Mol. Model. 2014, 20, 2172. 21 geometries have a value of less than 4 kJ mol . Within the [24] T. Fletcher, S. J. Davie, P. L. A. Popelier, J. Chem. Theory Comput. 2014, PDB/NM paradigm, the quality of the kriging models is not 10, 3708. [25] S. M. Kandathil, T. L. Fletcher, Y. Yuan, J. Knowles, P. L. A. Popelier, J. compromised much even if the training set consists of PDB- Comput. Chem. 2013, 34, 1850. sampled geometries only, which corresponds to maximum [26] T. J. Hughes, S. M. Kandathil, P. L. A. Popelier, Spectrochim. Acta A coverage of conformational space. In summary, the good news 2015, 136, 32. is that realistic dihedral angles can safely be combined with [27] T. L. Fletcher, S. M. Kandathil, P. L. A. Popelier, Theor. Chem. Acc. 2014, 133, 1499:1. realistic bond lengths and angles into a single successful krig- [28] K. Eskandari, C. Van Alsenoy, J. Comput. Chem. 2014, 35, 1883. ing model. [29] V. G. Tsirelson, A. I. Stash, S. Liu, J. Chem. Phys. 2010, 113, 114110. Further work utilizing rotamer libraries to guide the con- [30] S. Liu, J. Chem. Phys. 2007, 126, 244103. struction of training sets is planned to create training sets that [31] D. Fang, J.-P. Piquemal, S. Liu, G. A. Cisneros, Theor. Chem. Acc. 2014, 133, 1484. do not depend on the crystal structures sampled from, but still [32] C. M. Handley, G. I. Hawe, D. B. Kell, P. L. A. Popelier, Phys. Chem. mimic the structures expected in real proteins. Chem. Phys. 2009, 11, 6365.

1856 Journal of Computational Chemistry 2015, 36, 1844–1857 WWW.CHEMISTRYVIEWS.COM WWW.C-CHEM.ORG FULL PAPER

[33] D. A. C. Beck, D. O. V. Alonso, D. Inoyama, V. Daggett, Proc. Natl. Acad. [66] C. F. Matta, R. J. Boyd, The Quantum Theory of Atoms in Molecules; Sci. USA 2008, 105, 12259. Wiley, 2007. [34] V. Munoz,~ L. Serrano, Proteins: Struct. Funct. Bioinf. 1994, 20, 301. [67] L. Joubert, P. L. A. Popelier, Phys. Chem. Chem. Phys. 2002, 4, 4353. [35] A. D. Scouras, V. Daggett, Protein Sci. 2011, 20, 341. [68] D. A. Case, T. Darden., T. E. Cheatham, III, C. L. Simmerling, J. Wang, R. [36] P. Francis-Lyon, P. Koehl, Proteins: Struct. Funct. Bioinf. 2014, 82, 2000. E. Duke, R. Luo, K. M. Merz, D. A. Pearlman, M. Crowley, R. C. Walker, [37] M. V. Shapovalov, R. L. Dunbrack, Jr., Structure 2011, 19, 844. W. Zhang, B. Wang, S.; Hayik, A. Roitberg, G. Seabra, K. Wong, F. [38] S. J. Shandler, M. V. Shapovalov, J. R. L. Dunbrack, W. F. DeGrado, J. Paesani, X. Wu, S. Brozell, V. Tsui, H. Gohlke, L. Yang, C. Tan, J. Am. Chem. Soc. 2010, 132, 7312. Mongan, V. Hornak, G. Cui, P. Beroza, D. H. Mathews, C. Schafmeister, [39] S. C. Lovell, J. M. Word, J. S. Richardson, D. C. Richardson, Proteins: W. S. Ross, P. A. Kollman, AMBER 9; University of California; San Fran- Struct. Funct. Bioinf. 2000, 40, 389. cisco, 2006. [40] R. L. Dunbrack, F. E. Cohen, Protein Sci. 1997, 6, 1661. [69] K. Vanommeslaeghe, A. Hatcher, C. Acharya, S. Kundu, S. Zhong, J. [41] R. L. Dunbrack, Jr., Curr. Opin. Struct. Biol. 2002, 12, 431. Shim, E. Darian, O. Guvench, I. Lopes, I. Vorobyov, A. D. J. McKerell, J. [42] S. K. Burger, P. W. Ayers, J. Schofield, J. Comput. Chem. 2014, 35, 1438. Comput. Chem. 2010, 31, 671. [43] M. Korth, S. Grimme, J. Chem. Theory Comput. 2009, 5, 993. [70] P. L. A. Popelier, L. Joubert, D. S. Kosov, J. Phys. Chem. A 2001, 105, [44] R. Schweitzer-Stenner, Mol. Biosyst. 2012, 8, 122. 8254. [45] S. F. Sousa, P. A. Fernandes, M. J. Ramos, J. Phys. Chem. A 2009, 113, [71] M. S. Shaik, M. Devereux, P. L. A. Popelier, Mol. Phys. 2008, 106, 1495. 14231. [72] S. Y. Liem, P. L. A. Popelier, M. Leslie, Int. J. Quantum Chem. 2004, 99, [46] A. K. Jha, A. Colubri, M. H. Zaman, S. Koide, T. R. Sosnick, K. F. Freed, 685. Biochemistry 2005, 44, 9691. [73] M. A. Blanco, A. M. Pendas, E. Francisco, J. Chem. Theor. Comput. 2005, [47] A. Hagarman, D. Mathieu, S. Toal, T. J. Measey, H. Schwalbe, R. 1, 1096. Schweitzer-Stenner, Chem. Eur. J. 2011, 17, 6789. [74] P. L. A. Popelier, D. S. Kosov, J. Chem. Phys. 2001, 114, 6539. [48] A. Hagarman, T. J. Measey, D. Mathieu, H. Schwalbe, R. Schweitzer- [75] D. R. Jones, M. Schonlau, W. J. Welch, J. Global Optim. 1998, 13, 455. Stenner, J. Am. Chem. Soc. 2010, 132, 540. [76] J. Kennedy, R. C. Eberhart, Proceedings of IEEE International Confer- [49] K. Lindorff-Larsen, N. Trbovic, P. Maragakis, S. Piana, D. E. Shaw, J. Am. ence on Neural Networks, Vol. 4; 1995; p. 1942. Chem. Soc. 2012, 134, 3787. [77] Y. Li, A. Roy, Y. Zhang, PLoS One 2009, 4, e6701. [50] S. Pizzanelli, C. Forte, S. Monti, G. Zandomeneghi, A. Hagarman, T. J. [78] J. M. Word, S. C. Lovell, J. S. Richardson, D. C. Richardson, J. Mol. Biol. Measey, R. Schweitzer-Stenner, J. Phys. Chem. B 2010, 114, 3965. 1999, 285, 1735. [51] V. L. Cruz, J. Ramos, J. Martinez-Salazar, J. Phys. Chem. B 2011, 116, [79] J. W. Ochterski, Vibrational Analysis in Gaussian; Connecticut, USA, 469. 1999. Available at: http://www.gaussian.com/g_whitepap/vib.htm. [52] B. Rost, J. Struct. Biol. 2001, 134, 204. [80] M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, [53] S. Subramaniam, A. Senes, Proteins: Struct. Funct. Bioinf. 2014, 82, 3177. J. R. Cheeseman, G. Scalmani, V. Barone, B. Mennucci, G. A. Petersson, [54] C. Chipot, A. Pohorille, Free Energy Calculations: Theory and Applica- H. Nakatsuji, M. Caricato, X. Li, H. P. Hratchian, A. F. Izmaylov, J. Bloino, tions in Chemistry and Biology, Vol. 86; Springer, 2007. G. Zheng, J. L. Sonnenberg, M. Hada, M. Ehara, K. Toyota, R. Fukuda, [55] J. Higo, J. Ikebe, N. Kamiya, H. Nakamura, Biophys. Rev. 2012, 4, 27. J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao, H. Nakai, [56] Y. Okamoto, J. Mol. Graph. Model. 2004, 22, 425. T. Vreven, J. A. Montgomery, J. E. Peralta, F. Ogliaro, M. Bearpark, [57] P. G. Bolhuis, C. Dellago, D. Chandler, Faraday Discuss. 1998, 110, 421. J. J. Heyd, E. Brothers, K. N. Kudin, V. N. Staroverov, R. Kobayashi, [58] C. Dellago, P. G. Bolhuis, F. S. Csajka, D. Chandler, J. Chem. Phys. 1998, J. Normand, K. Raghavachari, A. Rendell, J. C. Burant, S. S. Iyengar, 108, 1964. J. Tomasi, M. Cossi, N. Rega, N. J. Millam, M. Klene, J. E. Knox, J. B. [59] C. Dellago, P. G. Bolhuis, In Advanced Computer Simulation Cross, V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts, R. E. Stratmann, Approaches for Soft Matter Sciences III, Vol. 221; C. Holm, K. Kremer, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli, J. W. Ochterski, Eds.; 2009; p. 167. R. L. Martin, K. Morokuma, V. G. Zakrzewski, G. A. Voth, P. Salvador, [60] A. Barducci, M. Bonomi, M. Parrinello, Wiley Interdiscip. Rev. Comput. J. J. Dannenberg, S. Dapprich, A. D. Daniels, O.€ Farkas, J. B. Foresman, Mol. Sci. 2011, 1, 826. J. V. Ortiz, J. Cioslowski, D. J. Fox, Gaussian, Inc.: Wallingford, CT, 2009. [61] H. Li, Z. Lin, Y. Luo, Chem. Phys. Lett. 2014, 610–611, 303. [81] T. A. Keith, AIMAll (Version 13.10.19); Missouri, USA, 2013. Available at: [62] P. L. A. Popelier, E. A. G. Bremond, Int. J. Quantum Chem. 2009, 109, http://aim.tkgristmill.com. 2542. [63] P. L. A. Popelier, In The Nature of the Chemical Bond Revisited, Chap- ter 8; G. Frenking, S. Shaik, Eds.; Wiley-VCH, Weinheim, Germany, Received: 5 May 2015 2014; p. 271. Revised: 19 June 2015 [64] R. F. W. Bader, Atoms in Molecules. A Quantum Theory; Oxford Univ. Accepted: 20 June 2015 Press: Oxford, Great Britain, 1990. Published online on 3 August 2015 [65] P. L. A. Popelier, Atoms in Molecules. An Introduction; Pearson Educa- tion: London, Great Britain, 2000.

Journal of Computational Chemistry 2015, 36, 1844–1857 1857 Theor Chem Acc (2016) 135:195 DOI 10.1007/s00214-016-1951-4

REGULAR ARTICLE

The prediction of topologically partitioned intra‑atomic and inter‑atomic energies by the machine learning method kriging

Peter Maxwell1,2 · Nicodemo di Pasquale1,2 · Salvatore Cardamone1,2 · Paul L. A. Popelier1,2

Received: 24 May 2016 / Accepted: 11 July 2016 / Published online: 27 July 2016 © The Author(s) 2016. This article is published with open access at Springerlink.com

Abstract The construction of a novel protein force field Keywords Quantum chemical topology (QCT) · called FFLUX, which uses topological atoms, is founded Interacting quantum atoms (IQA) · Quantum theory on high-rank and fully polarizable multipolar electrostatics. of atoms in molecules (QTAIM) · Kriging · Machine The machine learning method kriging successfully predicts learning · Amino acids · Force field multipole moments of a given atom with as only input the nuclear coordinates of the atoms surrounding this given atom. Thus, trained kriging models accurately capture the 1 Introduction polarizable multipolar electrostatics of amino acids. Here we show that successful kriging models can also be con- There is a consensus that traditional force fields, which structed for non-electrostatic energy contributions. As a are instrumental in the vast majority of modern molecu- result, the full potential energy surface of the (molecular) lar simulations, need further improvement. Their limiting system trained for can be predicted by the corresponding accuracy is regularly pointed out in the literature, as a cause set of atomic kriging models. In particular, we report on the for discrepancies between a given force field’s predictions performance of kriging models for each atom’s (A) (1) total and experimental results. Indeed, if the sampling during A A atomic energy (EIQA), (2) intra-atomic energy (Eintra) (both the simulation is adequate, then only the potential can be AA′ kinetic and potential energy), (3) exchange energy (VXC ) blamed if predictions fail to be trustworthy. In order to AA′ and (4) electrostatic energy (Vcl ) of atom A with the rest improve their energy prediction, popular force fields such AA′ of the system (A′), and (5) interatomic energy (Vinter). The as AMBER and CHARMM have been modified on several total molecular energy can be reconstructed from the krig- occasions. Fairly recent modifications were extensively ing predictions of these atomic energies. For the three case tested [1] in 2011 with the most powerful dedicated molec- studies investigated (i.e. methanol, N-methylacetamide ular simulation hardware in the world. The four force fields and peptide-capped glycine), the molecular energies were tested in this protein folding work were Amber ff03, Amber produced with mean absolute errors under 0.4, 0.8 and ff99SB*-ILDN, CHARMM27 and CHAMRMM22*. It 1 1.1 kJ mol− , respectively. was found that the folding mechanism and the properties of the unfolded state depended substantially on the choice of force field. Another extensive and more recent study 2[ ] Electronic supplementary material The online version of this concluded, from a millisecond of simulations on intrinsi- article (doi:10.1007/s00214-016-1951-4) contains supplementary material, which is available to authorized users. cally disordered proteins, that eight well-known force fields generate unexpectedly huge differences in chain dimension, * Paul L. A. Popelier hydrogen bonding and secondary structure content. In fact, [email protected] discrepancies are so serious that changing the force field 1 Manchester Institute of Biotechnology (MIB), 131 Princess has a stronger effect on secondary structure content than Street, Manchester M1 7DN, UK changing the entire peptide sequence. Such comparisons 2 School of Chemistry, University of Manchester, Oxford are quite rare but precious because they clearly demon- Road, Manchester M13 9PL, UK strate, while eliminating any sampling issues or hardware

1 3 195 Page 2 of 19 Theor Chem Acc (2016) 135:195 limitations, that much more work needs to be done. The training set of (molecular or cluster) geometries, and is able question is which type of work. to make a prediction to a previously unseen geometry of Over the last decade our strategy has been to re-examine the surrounding molecules. For this purpose we take advan- and challenge the core architecture of classical force fields. tage of the renowned interpolative power of kriging. More- The type of work that accompanies such a bold strategy over, when using kriging models in an extrapolative way, can be characterized as arduous, and above all, system- the model returns the average value of the atomic property atic. At the outset of this long term project, the ubiquitous of interest observed over the training set. Finally, it must be atomic point charges (one on each nucleus) were replaced pointed out that a kriging model is not returning an atomic by nucleus-centred atomic multipole moments. This step polarizability but the atomic multipole moment itself, after is shared by other next-generation force fields such as the polarization process is complete. This strategy has an AMOEBA [3], XED [4], SIBFA [5] and ACKS2 [6] and important advantage when the kriging models are used dur- is driven by a clearly justified desire towards increasingly ing a molecular dynamics simulation: the atomic moments accurate electrostatics [7, 8]. do not need to be computed (typically iteratively) from the The current approach embraces so-called topologi- polarizabilities. Instead, the multipole moments are pre- cal atoms as the “entity of information” from which any dicted “on the fly”, directly from the nuclear coordinates of system (molecular or ionic) is built. Much work has been the surroundings at any given time step. carried out [9–17] in order to obtain a deep understand- Most attention has been devoted to modelling the elec- ing of the convergence behaviour and accuracy of the trostatic interaction at long-range by means of kriged electrostatic interaction between topological atoms, as atomic multipole moments. As this procedure is understood well as the electrostatic potential they generate. Topologi- and works well [31–37], the next step is how to combine cal atoms are defined by the quantum theory of atoms in this electrostatic energy with the non-electrostatic energy molecules (QTAIM) [18–21] as finite-volume fragments contributions. Preliminary and unpublished work expressed in real 3D space. As quantum atoms [20, 22], topological the latter in the traditional manner, i.e. with Hooke-like atoms are deeply rooted in quantum mechanics [23]. These potentials reinforced with anharmonic extensions. The atoms result from a parameter-free partitioning of the elec- parameterization of these potentials, in the presence of tron density, introducing sharp boundaries whose shape kriged electrostatics, turned out to be inadequate. For this responds to any variation in nuclear geometry. The finite reason, a more satisfactory and elegant alternative strategy size of topological atoms prevents penetration effects and was carried out, which is to combine kriged electrostat- the associated correction in the form of damping functions. ics with kriged non-electrostatics. In this streamlined pro- Topological multipolar electrostatics proved to be success- cedure the machine learning method is trained for energy ful in the description of electrostatic interaction in proteins quantities that are obtained from the same topological [24]. energy partitioning [38] that yields the atomic multipole The next step in the construction of a topological force moments. In 2014, the atomic kinetic energy was suc- field is the inclusion of electrostatic polarization. In prin- cessfully kriged [39] as the first non-electrostatic energy ciple, the multipole moments of any given atom are influ- contribution. That work presented proof-of-concept based enced by all the atoms surrounding it, but this influence on four molecules of increasing complexity (methanol, typically decreases the further away the surrounding atoms N-methylacetamide, glycine and triglycine). For all atoms are. In order to be able to handle the full complexity of this tested, the mean atomic kinetic energy errors fell below 1 influence, we invoked machine learning early on. Initially 1.5 kJ mol− , and far below this value in most cases. we used neural networks [25] and applied it to water clus- In the current article, we go further and deliver proof-of- ters [26]. In 2009 it turned out [27] that a completely dif- concept for the kriging of non-electrostatic atomic energy ferent machine learning method called kriging performed contributions. For that purpose we have adopted the inter- more accurately than neural networks. Although kriging acting quantum atoms (IQA) scheme proposed by Blanco was computationally more expensive, it coped better with et al. [40]. This is a topological energy partitioning scheme, the larger number of molecules surrounding the atom of inspired by early work [11] on atom–atom partitioning of interest. Kriging [28], which is also known as Gaussian intramolecular and intermolecular Coulomb energy. In regression analysis [29], originates in geostatistics but has IQA, the kinetic energy is subsumed in the intra-atomic been used in very different application areas, including the energy (or sometimes called “self energy”), which also prediction [30] of atomic properties when inside molecules. contains the potential energy of the electrons interacting The essence of our kriging approach is the establishment with themselves and with the nucleus, both within a given of a direct mapping between an atomic multipole moment atom. This intra-atomic energy plays a pivotal role in ste- (output) and the nuclear coordinates (input) of the sur- reo-electronic effects, including intermolecular Pauli-like rounding atoms. This mapping is obtained after training to a repulsion. Furthermore, IQA can calculate the electrostatic

1 3 Theor Chem Acc (2016) 135:195 Page 3 of 19 195 interaction between two atoms that are so close that their multipolar expansion diverges. The IQA method achieves this goal by using a variant of the six-dimensional volume integration (over two atoms) proposed in Ref. [11], which avoids the multipolar expansion altogether. Similarly, IQA does not multipole-expand the (inter-atomic) exchange energy, although this can be done [15]. However, this route is not followed by our topological force field. This energy contribution expresses covalent bonding energy. Within the Hartree–Fock ansatz, these three energy contributions (self, Coulomb and exchange) complete an IQA partition- ing. However, post-Hartree–Fock methods introduce a fourth (non-vanishing) contribution that is associated with electron correlation. In the current work, we use a version of IQA [41] that is compatible with DFT with an eye on including electron correlation effects. We invoke the use of DFT level with the largest of systems studied here, capped glycine. Three molecules were chosen here for this proof-of-con- Fig. 1 Topological atoms in a conformation of N-methylacetamide cept investigation: methanol, N-methylacetamide (NMA) (NMA) and glycine, which is capped by a peptide bond both at its C and N terminus. These systems were chosen to rep- resent a progressive sequence of molecular complexity, consulted. We also note here that it has been shown before while being relevant to biomolecular modelling: methanol [47] that atom types that can be computed using the atomic features as the side residue in the amino acid serine, NMA properties of topological atoms in amino acids. is the smallest system modelling a peptide bond, while It is clear from Fig. 1 that QCT partitions a molecule capped glycine represents an amino acid in an oligopeptide. into well-defined non-overlapping atoms [48]. Moreover, This work is the first report of combining machine learning these topological atoms do not show any gaps between with a full topological energy partitioning. them; in other words, they partition space exhaustively. It is important to pause and briefly discuss the full consequence of this property. Exhaustive partitioning infers that each 2 Methodological background point in space belongs to a topological atom: all space is accounted for. In principle, this property must have reper- 2.1 The interacting quantum atoms (IQA) approach cussions [49] for docking studies, as will become clear when QCT starts being used at this larger molecular scale. Figure 1 shows examples of topological atoms appearing Classical drug design (e.g. [50]) thinks of both ligand and in N-methylacetamide, which were generated by in-house the protein’s active site as bounded by artificial surfaces software [42, 43]. QTAIM essentially defines a topological (e.g. solvent accessible surface) based on standard van der atom as a three-dimensional subspace determined by the Waals radii and an image of overlapping hard spheres. This bundle of gradient paths (of a system’s electron density) view necessarily introduces “gaps of open space”, which that are attracted to the atom’s nucleus. This partitioning belong neither to the ligand nor to the protein. However, idea also features in other topological approaches [44], such quantum mechanically we know that electron density as that in connection with the electron localization function resides in those gaps, no matter how small or thin they are. (ELF). The topological energy partitioning method IQA is Electron density generates an electrostatic potential, and a third approach that uses the central idea of the so-called hence, also generates electrostatic interaction energy con- gradient vector field to extract chemical information from tributions. If a gap is not accounted for, then energy will a system. Quantum chemical topology (QCT) [45] is a col- be missing, which interferes with the energy balance dur- lective name to gather all topological approaches (10 so far, ing the docking process. However, if there is no gap, as in see Box 8.1 in Ref. [20]) that share the abovementioned QCT, then all energy is properly accounted for. central idea. The acronym QCT resurfaces in QCTFF, the In brief, IQA quantitatively describes the total energy of force field under construction here [22], which uses topo- an atom, even if the system is not at a stationary point in the logical atoms. The new name for QCTFF is FFLUX, for potential energy surface. In other words, unlike in orthodox which a very recent and accessible perspective [46] can be QTAIM, there is no need to invoke the atomic virial theorem

1 3 195 Page 4 of 19 Theor Chem Acc (2016) 135:195

AB [20, 51], the application of which is restricted to stationary We have extensively studied this energy Vcl in terms points (such as equilibrium geometries). This total energy of its multipolar convergence behaviour. The quantity AB is comprised of the energy associated with the atom itself Vcl incorporates the widely reported electrostatic (intra-atomic), and with energy resulting from the interaction multipole moments’ contribution of the long-ranged elec- between the atoms (interatomic). We will explain each type trostatic energy, in addition to the short-range electro- of energy in turn, beginning with the decomposition of the static contribution obtained from IQA. Equation (3) can be molec molecular energy, EIQA , into the atomic energies, one for rewritten as Eq. (6). This is done by first substituting the A AB − AB each atom A, denoted EIQA, followed by its breakdown into bracketed expression by Vcl VCoul, as obtained from AB intra-atomic (or ‘self’) and interatomic interaction energies, Eq. (5), and then by substituting Vee using Eq. (4), such that after cancellation of VAB we obtain, 1 Coul Emolec = EA = EA + V AB IQA IQA intra 2 inter V AB = V AB + V AB + V AB = V AB + V AB (6) A A A B�=A inter cl X corr cl XC � � � � The new expression separates the interatomic energy 1 A AB into the interplay of ionic-like (VAB), covalent (VAB) and = Eintra + Vinter (1) cl X 2 AB A  B�=A  correlation (Vcorr) energies. Note that it is often convenient � �   to combine exchange and correlation in one term. These A AB where Eintra and Vinter are the intra-atomic (of atom A) and three energies along with the intra-atomic energy compose inter-atomic (between atoms A and B) energies, respectively. the four primary energies that FFLUX is built upon. The intra-atomic energy can be further partitioned, Until recently [41] the inclusion of any computa- tionally affordable correlation energy has been lacking EA = T A + V AA + V AA intra ee en (2) because IQA is incompatible with both perturbation the- where TA is the kinetic energy of the electrons associated with ory and density functional theory (DFT) methods. Indeed, AA atom A, Vee is the (repulsive) electron–electron poten- the methods that are compatible with the original IQA AA tial energy, and Ven is the (attractive) electron–nuclear (i.e. full configuration interaction (FCI), complete active potential energy. Together, these three energies comprise the space (CAS), configuration interaction with single and intra-atomic energy possessed by a single atom. double excitations (CISD) or coupled cluster with single The interatomic energy attributed to a pair of atoms can and double excitations (CCSD) levels of theory) demand also be further partitioned, much greater computational expense. Neither perturbation theory nor standard DFT methods provide a well-defined AB = AB + AB + AB + AB Vinter Vnn Ven Vne Vee (3) second-order reduced density matrix, and hence IQA can-  not be straightforwardly applied to them. Together with AB AB AB where Ven, Vne and Vee were described above but now Dr TA Keith, the main author of the QCT computer pro- with the ordering of the subscript components being allied gram AIMAll [52], a DFT-based IQA method that incor- to the ordering of the atoms in the superscript. For exam- porates at least some correlation was validated [41] by our ple, subscript ‘en’ and superscript ‘AB’ refers to the elec- group. The solution involved incorporating the explicit trons of atom A and the nucleus of atom B. In addition to B3LYP atomic exchange functional in order to correctly AB the electronic energy components, Vnn is the repulsive calculate an atom’s total atomic exchange, thereby recov- nucleus–nucleus potential energy. ering the ab initio energy of the whole molecule. How- AB The electron–electron energy Vee can be even further ever, the fact that the functional cannot be used to calcu- partitioned to give the components in Eq. (4), late interatomic exchange (see Ref. [41] for details) led us to calculate the interatomic component using the Hartree– V AB = V AB + V AB + V AB ee Coul X corr (4) Fock-like expression but then with Kohn–Sham orbit- AB where VCoul represents the Coulombic interaction between als inserted. The remaining intra-atomic exchange–cor- AB the electrons in atoms A and B. VX represents the inter- relation is then calculated as the difference between the AB electron exchange potential energy and Vcorr the inter-elec- atomic exchange–correlation directly obtained from the tron correlation potential energy. Combining the brack- B3LYP functional and the Hartree–Fock-like interatomic eted terms in Eq. (3) with the Coulomb energy only, leads exchange. to the total electrostatic energy between two atoms, or In this investigation, we will again make use AIMALL. AB AB Velec, which is often written as Vcl because of the This program is able to return a useful quantity, denoted AA′ “classical” nature of the electrostatic potential energy (devoid Vinter, which is defined as follows: from any purely quantum mechanical exchange energy), ′ V AA = V AB V AB = V AB + V AB + V AB + V AB inter inter cl nn en ne Coul B�=A (7) (5)  1 3 Theor Chem Acc (2016) 135:195 Page 5 of 19 195 where A′ represents every atom other than atom A. Note breakdown. Approach B exposes the separation of the that Eq. (7) defines the atom-centred interatomic energy intra-atomic and interatomic energies for insight into how AA′ contribution Vinter. In other words, it summarizes how an an atom itself experiences the environment it is in. Finally, atom interacts in full with all other atoms. Note that the Approach C takes the separation one step further, using the AA′ quantity Vinter is obtained computationally cheaper than by individual exchange and electrostatic energies in the intera- summing over the individual pair-wise atomic contributions tomic description of an atom. AB AA′ Vinter. However, the computational advantage of Vinter is off- In order to clarify the strategy for the complete treat- set by a reduction in the chemical insight that we obtain ment of energy contributions in FFLUX a comment about AB from being able to inspect each atom pair individually. Vcorr in Eq. (4) is necessary. This energy contribution cov- This loss occurs over and above that caused by lumping ers dynamic correlation and hence dispersion. Our pre- AB AB together the electrostatic, exchange or correlation energy ferred route is to treat Vcorr in exactly the same way as Vcl AB contributions (see Eq. 6). Instead, the interaction energy is and VX . This approach, for which proof-of-concept has defined in terms of a given atom A experiencing the entire been reached in our lab, will guarantee a seemless inte- AA′ surrounding molecular environment. Equally, Vinter can be gration of dispersion in the FFLUX ansatz. This strategy AA′ AA′ decomposed into Vcl and VX components, much like the will thereby avoid the typical problems (e.g. the need for pair-wise AB energies. For our current purpose, we are sat- damping functions) that alternative dispersion methods isfied with this formulation in spite of the reduced chemical introduce. insight it gives because our prime motivation is to predict For a more exhaustive description of the IQA partition- atomic energies, and not to predict chemical insight. ing scheme including additional formulae, its capabilities Returning to the IQA formalism, we obtain Eq. (8), and previous applications, we refer to the original literature which expresses three ways (“approaches”) to break up [40, 53–58]. the molecular energy into atomic contributions. Approach AA′ A was already present in Eq. (1) while substituting Vinter 2.2 Kriging (Gaussian regression analysis) of Eq. (7) into Eq. (1) leads to Approach B. Finally, Approach C follows from Eq. (6) and now applying the As a machine learning method, kriging has its roots in geo- AB AB idea behind Eq. (7) to Vcl and VXC, yielding statistics where it has been used to predict the location of precious material after being taught these locations [28]. 1 ′ Emolec = EA = EA + V AA Within FFLUX, kriging is used to map geometrical change IQA IQA intra inter A A 2 A within a molecule, obtained from nuclear coordinates, to a corresponding atomic property, which can be an IQA Approach A Approach B energy or atomic multipole moment. The atomic property 1 ′ 1 ′ =  EA  +  V AA + V AA intra XC cl (8) is the machine learning output and the coordinates are the A 2 A 2 A input. Although the full details are given elsewhere [33, 34] Approach C we explain here how these coordinates are constructed. It is advantageous that the coordinates are internal in nature which is the key equation for the analysis in this paper. (so only 3N 6 for a nonlinear N-atom system). On each AA′ − Note that Vinter is always halved when used in Eq. (8), nucleus we install a so-called atomic local frame (ALF), attributing half of the energy to a single atom A, in order which enables the definition of a polar angle and an azi- to prevent double-counting of the interatomic energy in muthal angle to describe the position of each nucleus in the molecule. the system, except for the three nuclei required in defin- We aim for a greater understanding of both the quan- ing the ALF. The distance between the ALF’s origin and A A titative nature of these five types of energy (EIQA, Eintra, the nucleus completes the triplet of (spherical polar) coor- AA′ AA′ AA′ Vinter , VXC and Vcl ) and their suitability in FFLUX after dinates for a given nucleus in the system (other than that being kriged. As suggested in Eq. (8), the molecular energy on which the ALF is installed). Machine learning language can be recovered through three different approaches, each calls these coordinates features, as they are the input vari- incorporating the use of different IQA energies. Approach ables to kriging in this case. Finally we note that the first A uses only the total atomic energy of an atom, denoted three features of the vector of 3N 6 features (necessary A A − EIQA. The atomic energy EIQA is a sum of the intra- and to describe unambiguously a molecular geometry) consists inter-atomic energy and hence expresses their result- of (1) the distance between the origin and the first nucleus, ing “trade-off” by the single quantity that it is. This final which fixes the ALF’s x-axis, (2) the distance between the A energy, EIQA, suffices by itself for FFLUX being able to origin and the second nucleus (fixing the ALF’sxy -plane), predict the structure and dynamics of a system, because and (3) the angle suspended by the first nucleus, the origin the latter only depend on the total atomic energy, not its and the second nucleus.

1 3 195 Page 6 of 19 Theor Chem Acc (2016) 135:195

and the outcomes recorded. If the coin is fair, then the parameter, which governs the outcomes and which is called

th, is exactly 0.5 for tossing a head. This parameter is analo- gous to parameters θk and pk in the formula of Fig. 2. Now suppose that we observe a statistical bias towards the out- come of “heads up”. We can then ask to what the extent the coin is biased. In other words, given the observations made

th differs from 0.5. In fact, one can find the value of the parameter th such that the likelihood is maximal of again observing the outcomes that were observed. For exam- ple, one could find that t 2/3 is this value. Similarly, h = when this idea is applied to the kriging problem at hand, the so-called likelihood function is maximized against the observed data, which are the atomic energies. This proce-

dure then returns the optimal values of the parameters θk and pk. The technical details of this optimization are com- plex and extensively researched in our laboratory [59]. The Fig. 2 Summary of kriging method at the heart of FFLUX. Atomic next paragraph provides a very brief summary. With regard energies (intra-atomic, inter-atomic, or total sum) of a given topo- to using the key kriging formula in Fig. 2 (after training), logical atom (right panel, output) are mapped onto the features {fk}, which describe the nuclear geometry of the environment surrounding previously unseen features F {f } are inserted, returning = k this given atom. The kriging parameter θk and pk are optimized (see the atomic property f(F). It is clear that the argument of the main text) exponential is a distance function, which is not necessarily Euclidean (i.e. pk �= 2). Before introducing the mathematics involved, it is useful In terms of the optimization procedure, first the concen- to understand the kriging procedure qualitatively. Before a trated log-likelihood is calculated analytically. The function kriging model can predict a quantity, it must first be trained is then maximized by a different machine learning method for using a number of molecular geometries with corre- because this cannot be achieved analytically. We have suc- sponding atomic properties. These data form the training cessfully used particle swarm optimization (PSO), the set while the test set will consist of data that do not belong mathematical details of which can be found in Ref. [60]. to the training set. The external character of the test set The optimization of the parameters θk and pk via PSO is the makes the assessment of the kriging predictions meaning- most computationally expensive step in the overall kriging ful, because predictions of the training set data would be process. However, optimizing these parameters allows the exact anyway (with the type of kriging used here). The user to obtain the highest possible concentrated log-like- molecular geometries used to build each set are obtained lihood function, ensuring that the best possible model is from sampling a molecular energy well that surrounds an obtained. The time for the PSO optimization is proportional energy minimum. The details of this sampling are given to the number of geometries in the training set and the below, in Sect. 2.3. number of atoms in the molecule. The result is an analyti- Figure 2 summarizes the kriging approach, showing the cal formula (see Fig. 2) linking the geometrical features of key formula that maps the input (left panel, features) to the a molecule and the atomic property of choice. For a more output (right panel, atomic energies). The feature vector F comprehensive description of the kriging protocol, the collects all Nfeat features fk, which have been detailed above. reader is invited to refer to our previous publications [33, Note that Nfeat is the dimensionality of the feature space in 35–37, 59, 61]. which all Nex molecular geometries are expressed. In the formula of Fig. 2, Nex is the number of training examples, 2.3 Sampling of distorted geometries μ represents the mean of the observed value (also known as a constant “background” value) while wi is the kriging The selection of training examples with which to construct “weight” (obtained from the so-called correlation matrix a kriging model is of great importance. The geometries of [33]). Note that the name weight should not conjure up the training set should be representative of the physically an image of arbitrary adjustments because each weight is realistic regions of conformational space. This represen- computed exactly as explained in Ref. [33]. tation ensures that predictions corresponding to relevant Again, instead of giving the full mathematical details molecular geometries are always made in areas of con- [33, 59] of the kriging procedure, we here explain the crux formational space that have been trained for in the krig- in words. Imagine a coin being tossed a number of times ing model. Conventional methods will use some form of

1 3 Theor Chem Acc (2016) 135:195 Page 7 of 19 195 molecular mechanics, parameterized by a classical force where ki is the force constant of the ith normal mode (cal- field, to generate the training geometries. However, it has culable from νi), and Ei is the energy available to it. Each been shown that molecular mechanics does not necessarily normal mode is allocated an amount of energy given by a sample the relevant areas of conformational space over the standard equipartition, Ei = kBT/2, where kB is the Boltz- course of a typical trajectory [62–64]. mann constant and T is the temperature at which the sim- Our approach attempts to locally approximate the ab ini- ulation is performed. To allow for a little more flexibility, tio molecular potential energy surface about a “seeding” each of the Ei is subjected to a stochastic Gaussian fluctua- conformation, which in this work is the global energetic tion. Given the amplitude and frequency of the oscillator, minimum of the molecule. Whilst more than one seeding each normal mode can evolved in discrete time. conformation can be used, granting a greater exploration of The final matter requiring discussion is how time is dis- the potential energy surface, we found that for the molecules cretized for our equations of motion. Given the frequency considered in this work, the Boltzmann weight of the global of the oscillator, we can compute its time period, Ti = 1/νi. minimum exceeded 0.75 in all cases. By evaluating the first- We choose a discrete timestep, ti, such that for each time and second-order spatial derivatives of the potential energy period, Ti = tincycle, where ncycle is a user-defined param- (Jacobian and Hessian, respectively) at the seeding geom- eter. After every ncycle timesteps, we also perturb the energy etry, one can construct an approximate local potential energy available to each normal mode by a new Gaussian-distrib- surface through a Taylor expansion. The dynamics on this uted number. To reduce the correlation between samples, a local approximation to the potential energy surface are then final user-defined parameter,n out is also defined. We define governed by a set of harmonic equations of motion, referred nout to correspond to the number of discrete timesteps that to as the molecular normal modes [65]. we allow to pass before outputting a sample to the training Here we outline the major features of our methodology, set. For the work conducted here, we set ncycle = 10 and whilst a more thorough description of is given elsewhere nout = 100. [66, 67]. For an N-atom molecular system, we can define a 3 N 3 N transformation matrix, D, that converts a mass- × weighted Cartesian state vector, q, to an internal coordinate 3 Computational methods state vector, s. Given D, we can transform the mass-weighted Cartesian Hessian, Hq, to an internal coordinate basis through 3.1 The GAIA protocol DTH D = H q s (9) Three molecules (methanol, NMA and peptide-capped The frequencies of the molecular normal modes are then glycine) have been selected as case examples. Initially, the given by diagonalising Hs methanol and NMA molecules were generated in Gauss- View and optimized to a minimum energy geometry using E−1 E = Hs Iλ (10) the Gaussian 09 program, at HF/6-31 G(d,p) level of the- + where E correspond to the eigenvectors of Hs, I is the identity ory. Single-point energy calculations were performed on matrix, and the eigenvalues of Hs are given by the 3 N diago- the resulting structures, outputting the respective molecular nal elements (Iλ)ii = λi. The ith normal mode frequency, νi, wavefunction and Hessian of the potential energy, calcu- is related to the ith eigenvalue through the expression lated at the same level. For the capped glycine molecule, we selected the global minimum conformation from the λ = i (11) nine known energetic minima described in a previous pub- νi 2 2 4π c lication [68]. For glycine, the calculations were performed where c is a conversion factor incorporating the speed of light at B3LYP/apc-1 [69] level of theory, in-keeping with the and a conversion from atomic units to reciprocal centimetres. level of theory used in previous research [68]. Working at Six of these normal mode frequencies are equal to zero in an B3LYP level complements a recent publication [41] that internal coordinate basis, corresponding to the global transla- validates the extension of the IQA approach to the B3LYP tional and global rotational degrees of freedom of the molec- density functional. Prior to such work, the typical IQA par- ular system. titioning restrictions demanded a well-defined second-order The amplitude of vibration of the ith normal mode, Ai, density matrix, thus ruling out correlation-inclusive and is given by the standard expression for a simple harmonic approximate Hamiltonian theories, including the density oscillator, functional theory (DFT) functionals [40, 70–72]. After obtaining the molecular wavefunctions, the pro- 2Ei cess of sampling, performing the energy partitioning and Ai = (12) ki building the kriging models was achieved using the in- house pipeline software, known as the GAIA protocol. The

1 3 195 Page 8 of 19 Theor Chem Acc (2016) 135:195

Fig. 3 GAIA protocol used to develop kriging models for FFLUX

GAIA protocol is outlined in Fig. 3, which displays the 4. FEREBUS: uses a training set of molecular geometries sequence of steps, flowing left to right, starting with devel- to build kriging models of atomic energy (any of the opment of sample geometries and terminating with analys- five types above). FEREBUS then validates each model ing the models. by predicting a test set and comparing the models’ pre- GAIA automates the passing of information between two dicted value to the known true value. in-house Fortran 90 programs (TYCHE and FEREBUS— orange boxes in Fig. 3) and two commercially available The GAIA protocol outlined here is a slight deviation programs (GAUSSIAN09 [73] and AIMAll [52]—green to that reported [22] before for the parameterization pro- boxes). The output data from one step subsequently forms cedure of FFLUX. The deviation is a result of the current the input for the following step, until a seeding geometry (or exclusion of atomic multipole moments but incorporation set of seeding geometries) has been converted into a fully of the IQA atomic energy components instead. Thus, Fig. 3 trained kriging model. Each program’s role within GAIA represents the protocol tailored to this investigation only. can be summarized in a few lines: A set of 4000 initial samples were generated for each molecule by TYCHE from the distortion of a single energy 1. TYCHE: distorts an input seed geometry, using the minimum, at a user-defined temperature of 450 K. After molecular normal modes, to create a broad range of single-point energy calculations and wavefunctions were sample geometries that collectively describe a local obtained from Gaussian 09 for every sample, IQA energy patch on the molecular potential energy surface (around contributions were obtained from AIMAll (with default the seed). quadrature and integration grid options). We set to the 2. Gaussian 09 [73]: performs single-point energy cal- value of 3 the AIMAll parameter ‘-encomp’ referring to culations and outputs the wavefunction of each mol- the IQA energies to be computed. As soon as one atom ecule. attains a Lagrangian integration error, L(Ω), greater than 3. AIMAll [52] (version 15.09.12): starts from the wave- the user-defined threshold of 0.001 Hartree, then this atom function of a molecule to obtain the IQA energy parti- is removed from the training set, as well as all remaining A A AA′ AA′ AA′ tion values: EIQA, Eintra, Vinter, VXC and Vcl . atoms of the molecular geometry in which the offending

1 3 Theor Chem Acc (2016) 135:195 Page 9 of 19 195 atom occurred. This process is known as scrubbing in molecular geometry point corresponds to one point on the GAIA. Scrubbing ensures that samples with “noisy” atomic S-curve. In order to plot the total molecular energy error energies (i.e. large L(Ω) value) are excluded from the (x-axis), it must be calculated the generalized expression development of the model. From the samples remaining in appearing in Eq. (13), the pool after scrubbing, 500 are set aside as the test set Natoms n≤3 and the remaining number of samples (to the nearest hun- Molec = [ A − A ] EIQA EY,Act EY,Pred (13) dred) are used for the training set. The resulting training A Y sets were 3400, 3300 and 3000 for methanol, NMA and   capped glycine, respectively. These training sets were then where ‘Y’ is a general notation representing any of five pos- A A AA′ AA′ employed to generate kriging models for each molecule sible IQA energy contributions (EIQA, Eintra, Vinter, VXC and AA′ using the in-house program FEREBUS. The kriging param- Vcl ), and n is the number of atomic energies being used to eter, pk, was optimized in the development of all the mod- describe an atom (or the total atomic model). In this work n els. The settings in FEREBUS were as follows: noisy krig- can be one, two or three only (hence the upper limit n 3 in 9 ≤ ing was not requested, tolerance set to 10− , convergence Eq. 13). The value of n depends on the modelling approach. to 200 and the swarm-specifier to “dynamic”. Finally, so- In particular, for approach A we have that n 1 (EA ), for ′ = IQA called S-curves are produced to illustrate the energies errors approach B n 2 (EA and V AA ) and approach C leads = ′ intra ′ inter on each molecular model. The development and meaning to n 3 (EA , V AA and V AA ). Before summing over the = intra XC cl of an S-curve is described in the next section. atoms, counted by index A, a sum over n atomic energies must take place in order to obtain the atomic model. When 3.2 Energy error analysis a model is tested, the predictions of the model can be aver- aged and compared to the average value of the true values. As announced earlier, each molecule will be modelled As a result, a model can be said to, on average, slightly using three approaches, resulting in a tiered level of chemi- over- or under-predict as determined by a positive or nega- cal information being incorporated into the molecular tive difference between the averaged predicted and true val- models: ues. Therefore, summing over these energy models allows for a cancellation of such errors across the models in two • Approach A: Modelling the molecule using only the possible ways: (1) across the atomic energies that together A total unpartitioned atomic energy, EIQA. constitute a single atom, which is atomic cancellation, and • Approach B: Modelling the molecule using the intra- (2) across the atomic models that together constitute the A AA′ atomic (Eintra) and interatomic (Vinter) atomic energies. molecule, molecular cancellation. The value obtained for Molec • Approach C: Modelling the molecule using the intra- EIQA , as plotted on the S-curve x-axis, represents the A atomic energy (Eintra) and the two key interatomic ener- final error for the molecular energy. AA′ gies: exchange–correlation (VXC ) and classical electro- The mean absolute error (MAE) for the molecular model AA′ static (Vcl ). is calculated according to Eq. (14). The MAE can be used as a simple measure of the model quality and can be calcu- Approach A provides the fastest (computationally) lated for a single energy model or for a collection of mod- and simplest model of a molecule, at the atomic level. els (such as the resulting molecular model, see Eq. 13), Approach B offers a chemically intuitive separation of the N intra-atomic and overall interatomic energies and provides 1 test EMolec = EMolec (14) models for both. Approach C offers the highest level of MAE N IQA,M test M=1 chemical detail (in this investigation), separating the inter- atomic interaction energy into the covalent-like exchange where the sum runs over the N 500 molecular geom- test = and ionic-like electrostatic components, and again returns etries of the test set (of methanol, NMA or glycine). models for each. A final measure, the MAE percentage error, MAE%, Moving on to the analysis of the models, we should can also be calculated by dividing the MAE by the size of reintroduce how S-curves can be used to fully convey the energy well range of the test set. This error is given in the quality of a kriging model. The S-curve is a cumula- Eq. (15), tive distribution function (up to 100 %) of absolute energy errors for each test point within the test set. An S-curve EMolec MAE% = MAE plots the absolute (energy) error over the whole molecule TestSet − TestSet (15) EMAX EMIN (x-axis) versus the test set data point (i.e. molecular geom- etry) (y-axis) as represented as a percentage (100 %/500 where ‘MAX’ refers to the highest molecular energy in the data points 0.2 % per data point). Thus, each test set test set, and ‘MIN’ to the lowest. Note that the Electronic = 1 3 195 Page 10 of 19 Theor Chem Acc (2016) 135:195

Fig. 4 Set of 4000 distorted methanol samples as generated from the Fig. 5 Methanol S-curves showing the absolute errors for each of the in-house program TYCHE through sampling of the normal modes at three modelling approaches, each tested on the same 500 test set sam- a temperature of 450 K ples

Supplementary Material reports atomic MAE% values, modelling approaches. Such a low error is a pleasing result which are defined in Eq. S1 there, by simply replacing and an encouraging start to this analysis. An analysis of Molec Atom EMAE by EMAE . Converting the error into a percentage each molecular model (i.e. approach) is given in Table 1. allows a transferable measure, independent of the energy Interestingly, the simplest model, approach A, performs range of the sampling well, thereby making the MAEs from the best out of the three approaches with a MAE% error of different molecules more comparable. The MAEs can also 0.3 %. Approaches B and C perform very similarly with an be used to compare the quality of individual atomic energy MAE% error of 0.4 % each, respectively. models (also in Eq. 15), which individually may experience Notably, the maximum absolute error observed for 1 a broad variety of energy ranges. approach B is 2.3 kJ mol− , larger than that for approach 1 Finally, this work shows, for the first time, S-curves of C, which returns 1.7 kJ mol− . One would expect that the Molec the complete molecular energy (�EIQA ) rather than only most chemically insightful approach, which is C, is the one the multipolar electrostatic energy or later the kinetic energy. that accumulates the highest error. This presumption fol- Because multipole moments are not used in this work all lows from the fact that approach C (which has 18 models, atoms can interact with each other electrostatically (with- or 3 energies for each of the 6 atoms) has 6 additional mod- out concerns about possible divergence). In other words, the els compared to approach B (with 12 models, or 2 energies complete electrostatic interaction is subject to kriging here, for each of the 6 atoms), and the extra models accrue addi- for the first time, covering all 1,2; 1,3 and 1,4 interactions. tional kriging errors with each model. This matter will be discussed in Sect. 5.2. At the atomic level, Fig. 6 shows the MAEs for each A A AA′ AA′ AA′ 4 Results atomic energy type (EIQA, Eintra, Vinter, VXC and Vcl ) for each atom in methanol. At first glance, carbon influences 4.1 Methanol the accuracy of the model. In general, C1 has the highest MAEs and is therefore the least accurately modelled atom The 4000 samples (i.e. molecular geometries) generated by overall. Following C1, O2 has the next highest errors, fol- TYCHE (see Fig. 4) were subject to scrubbing in GAIA, lowed by the methyl hydrogens (H3/4/5), and finally the followed by 500 samples then set aside as the test set. most accurately modelled atom, the alcoholic hydrogen After rounding down to the nearest hundred, 3400 samples H6. In assessing how the IQA atomic energy types com- remained and formed the training set. These 3400 training pare, the story is also clear. Without exception, the follow- A set samples sampled an energy well with an energy range ing order appears, starting with the least accurate: Eintra & 1 AA′ AA′ A AA′ of ~115 kJ mol− . Vcl < Vinter < EIQA < VXC . The errors for all energies for 1 Figure 5 plots the S-curves for methanol for each of the any atom never exceed 0.8 kJ mol− . The interplay between ′ three modelling approaches (A EA , B EA and V AA , these energies will be discussed in the Sect. 5. ′ ′ = IQA = intra inter C EA , V AA and V AA ). Only looking at the MAE ignores the range of energy = intra XC cl It is clear that over 95 % of the test set geometries that a particular energy has been subjected to in the sam- Molec 1 have EMAE energy errors below 1 kJ mol− , across all pling stage. The MAE percentage error (MAE%) makes the

1 3 Theor Chem Acc (2016) 135:195 Page 11 of 19 195

Table 1 Quantitative analysis of the methanol models

1 All energies are given in kJ mol− . MAE% represents the MAE error with respect to the energy range of the test set

Fig. 7 Set of 4000 distorted NMA samples as generated from Fig. 6 Breakdown of atomic energy errors per atom for methanol. TYCHE through sampling of the normal modes at 450 K 1 All energies are in kJ mol−

MAEs more comparable when assessing the difficulty for the kriging engine. The MAE percentage errors, along with the MAEs and energy ranges, are tabulated in Table S1 in the Electronic Supplementary Material. Approach A corresponds to S1 (a), approach B to S1 (b) and approach C to S1 (c). The A analysis for the Eintra model [used in both approaches B and C and given in S1 (b)] is not repeated in S1 (c), as the same model is used for each approach. Figure S1 plots MAE ver- sus atomic energy range in order to observe any correlation between them. Some weak correlation can be seen, but noth- ing strong enough to validate such a relationship.

4.2 NMA Fig. 8 NMA S-curves showing the absolute errors for each of the three modelling approaches, each tested on the same 500 test set sam- ples The NMA models were trained using 3300 training set samples (Fig. 7), sampling an energy well with an energy A 1 the S-curve models, with again approach A (EIQA) being the range of ~84 kJ mol− . best modelled of the 3 approaches. Interestingly and unex- Figure 8 plots the S-curve for each NMA molecular A AA′ AA′ model. pectedly, approach C (Eintra, VXC and Vcl ) performs better A AA′ In Fig. 8, almost 95 % of the all test set samples have than approach B (Eintra and Vinter) for the NMA system. Given Molec 1 the EIQA energy correctly predicted within 2.5 kJ mol− this result, we must now consider whether the dual-cancel- (across all models). This time there is a clearer separation of lation (atomic and molecular) allowed in Eq. (13), prevents

1 3 195 Page 12 of 19 Theor Chem Acc (2016) 135:195

Table 2 Quantitative analysis of the NMA models

1 All energies are given in kJ mol− . MAE% represents the MAE error with respect to the energy range of the test set a straightforward correlation between the number of atomic models composing a molecular model and the quality of the molecular model. Glycine will further aid our understanding of this, and the topic will be discussed further in the Sect. 5. The MAE percentage errors for approaches A, B and C are 0.6, 1.3 and 0.9 %, respectively (see Table 2). Again, this is a pleasing result considering the molecular complex- ity has risen from (3 6) 6 12 geometrical features × − = for methanol, to (3 12) 6 30 features for NMA. × − = While the number of geometrical features increased by a factor 2.5, the errors for approaches A and C barely dou- bled. This favourable behaviour stimulates a further upscal- ing of features. With the S-curves being less entangled for NMA, the maximum absolute error falls in line with the Fig. 9 Breakdown of atomic energy errors per atom for NMA. Ener- gies are in kJ mol 1 shape and position of the respective S-curve. − Figure 9 is the counterpart of Fig. 6, this time for NMA. Here, we can further confirm that the atomic energy MAEs appear directly related to both the element and energy type being modelled. Initially, the atoms forming NMA can be immediately separated into their elements for the carbon, oxygen and hydrogen atoms, but the nitrogen atoms have very similar errors to the methyl carbons. In NMA, the MAEs also separate the atoms into atom types for carbon (car- bonyl carbon and methyl-cap carbons are easily distinguish- able). In fact, the oxygen is also modelled so well it is close to being indistinguishable from the hydrogens. Following this, the same trend in energy prediction accuracy is present A in NMA as it was in methanol, that is, the sequence Eintra & AA′ AA′ A AA′ Vcl < Vinter < EIQA < VXC (most accurate) remains valid. The MAE percentage errors, MAEs and energy ranges Fig. 10 Set of 4000 distorted capped glycine samples generated from are all reported for each atom in Table S2. Figure S3 once TYCHE through sampling of the normal modes at 450 K more confirms the weak correlation between MAE and energy range, this time for NMA. Figure S4 is analogous to 4.3 Glycine (Gly) Fig. 9, but plotting MAE% instead of MAE. In going from MAE to MAE%, the range of the energy is now incorpo- The capped glycine models were trained using 3000 train- rated. As we can see in Figure S4, the trend previously ing set samples (Fig. 10), sampling an energy range of ′ ′ ′ A AA AA A AA 1 identified for the MAE (Eintra & Vcl < Vinter < EIQA < VXC ) ~163 kJ mol− . is no longer true, and no clear trend is seen. Figure 11 shows the S-curve for glycine.

1 3 Theor Chem Acc (2016) 135:195 Page 13 of 19 195

Fig. 12 Breakdown of atomic energy errors per atom for capped gly- Fig. 11 Capped glycine S-curves showing the absolute errors for 1 each of the three modelling approaches, each tested on the same 500 cine. Energies are in kJ mol− test set samples geometrical features for NMA, to (3 19) 6 51 fea- × − = Table 3 Quantitative analysis of the capped glycine models tures for glycine. In spite of a near doubling of the number of geometrical features, there is little change in the MAEs going from NMA to glycine. Figure 12 reconfirms the trends identified in Fig. 6 and Fig. 9. Indeed, MAEs are related to the element being modelled and IQA energy type, with atom typing appearing even more evident. Within the carbons, three types are present but only two classes are distinguish-

able: CC O > Cα Cmethyl. Once more, the nitrogen atoms = ≡ are fairly indistinguishable from the latter class of car- bons. This time, the amino hydrogens (H5 and H11) also appear with slightly higher errors than seen for the ali- phatic hydrogens. When looking at the trends in the ener- gies themselves, the trends previously seen for methanol A and NMA are once more observed in Gly, where Eintra & AA′ AA′ A AA′ 1 Vcl < Vinter < EIQA < VXC (most accurate). A direct com- All energies are given in kJ mol− . MAE% represents the MAE error with respect to the energy range of the test set parison of the errors on the atoms present in both NMA and Gly will be given in the Sect. 5. Table S3 and the corresponding plots of Figures S5 and The inset glycine conformation in Fig. 11 is the global min- S6 are analogous to Table S2 and Figures S3 and S4, this imum identified by TYCHE as forming ~75 % of the Boltz- time for glycine. The same observation of a weak correla- mann distribution when all 9 energy minima were used as tion between energy range and MAE for each IQA energy seeds. The predominance of the global minimum determined type is made in Figure S5. Figure S6 displays the MAE% our choice to sample only around this conformation. Almost values for capped glycine. It is clear that 13 out 19 atoms Molec A 95 % of the all test set samples have the EIQA energy cor- show EIQA standing out as the least accurate energy to 1 rectly predicted within ~2.9 kJ mol− (across all models). The model. For previous systems this majority trend was not S-curves show a resemblance to those in Fig. 5 but shifted seen. However, this conclusion can be rationalized by A to a higher prediction error. Here, there is no longer a clear remembering that EIQA is the total atomic energy, and hence separation of the S-curves for modelling approaches B and influenced by every type of energy change within the atom. A C. Approach A (EIQA) still has the lowest errors of the three Hence, it would be reasonable for it to be the most sensitive A AA′ approaches. Surprisingly approaches B (Eintra and Vinter) and when the energy is considered relative to the energy range. A AA′ AA′ C (Eintra, VXC and Vcl ) is that they again result in the same 1 MAEs of 1.1 kJ mol− . These data are presented in Table 3. The MAE percentage errors for approaches A, B and C 5 Discussion are 0.6, 0.9 and 0.9 %, respectively (see Table 3). Again, this is a very pleasing result considering the molecular The discussion is divided into four subsections covering complexity has once more risen from (3 12) 6 30 key topics that have either been postulated at the beginning × − = 1 3 195 Page 14 of 19 Theor Chem Acc (2016) 135:195

1 of the research or have arisen during the analysis of the error never exceeded 1.5 kJ mol− (with the majority under 1 results. The first two subsections each correspond to an 1 kJ mol− ), or a MAE percentage error over 1.4 %, for any objective. energy. Another measure of quality to loosely compare 5.1 Feasibility of modelling IQA energies our results with are the previously kriged electrostatic multipole moments, which describe the classical electro- The first objective of this investigation is to assess the suit- static interaction energy for 1,4 and higher order interac- ability and quality of modelling the five IQA energies that tions [33]. Here, the notation ‘1,4’ describes the interaction can be used in the formulation of FFLUX. In short, we between atoms separated by 3 covalent bonds. A 1,5 inter- have successfully kriged models and made good molecular action has 4 separating covalent bonds, and so on. For 1,4 energy predictions using three possible combinations of the and higher interactions (i.e. 1, n > 4) in a capped histidine IQA energies, making them all suitable for use in FFLUX. system (29 atoms), the MAE for the intramolecular electro- The resulting molecular models having excellent errors, of static energy calculated through kriged multipole moments, 1 1 AA′ AA′ less than 0.4, 1.3 and 0.7 kJ mol− for methanol, NMA was 2.5 kJ mol− . In our investigation Vcl and Vinter never ± 1 and Gly, respectively. Qualitatively speaking, Figs. 5, 8 and exceed an MAE of 1.4 kJ mol− . Is this MAE respectable 11 display behaviour analogous to previous S-curves in pre- compared to the multipolar electrostatic energy error? Yes, AA′ AA′ vious literature [31, 39], leading to an overall successful because Vcl (and Vinter too) accounts for all electrostatic prediction of both multipolar electrostatic and non-electro- interactions and the multipolar electrostatic analysis only static energetics. for 1,4 and higher interactions (due to convergence limita- In order to quantitatively compare the quality of the tions). Admittedly, a training set of only 600 training set results, we need to draw on a previous paper [39] where geometries [33] was used for the latter analysis. A A a component of Eintra, namely the kinetic energy T , was Finally, we point out that we are currently investigating kriged for every atom in a similar set of systems (methanol, a potential reduction in the number of training set samples NMA, glycine and triglycine). We decided to compare our needed to obtain suitably accurate atomic and molecular results with the kinetic energy results, only for methanol models. This research is focussed on the selective building and NMA, because their training set sizes match best. In of training sets, and a variety of approaches are currently that work [39], MAEs for the atomic kinetic energy were being investigated to achieve this. 1 1 obtained of 0.8 kJ mol− (0.1 %) and 0.7 kJ mol− (0.3 %) for a methanol–carbon and the carbonyl–carbon in NMA, 5.2 Cancellation of errors respectively. In our work, we have presented differing training set sizes (3400 and 3300 for methanol and NMA, A second objective of the current investigation is to observe respectively), but it is still useful to compare the results. to what extent any cancellation of errors takes place within A 1 For Eintra we obtain MAEs of 0.7 kJ mol− (0.3 %) and the summative combination of kriging models described 1 AA′ 1.5 kJ mol− (0.4 %), respectively, and for Vinter we have in Eq. 9. As previously described, this approach offers 1 1 MAEs of 0.5 kJ mol− (0.3 %) and 1.0 kJ mol− (0.2 %) the potential to benefit from the fortuitous cancellation of for the equivalent atoms, respectively. Hence, despite our errors, but also equally the unfortunate accumulation of training sets being larger, the MAE errors remain slightly errors as a result of the machine learning. In particular, an higher than those observed for the kinetic energy. This con- atomic energy component may be, on average, predicted firms an initial suspicion that the summative nature of both to be more stable than the true energy (i.e. overestimated). A AA′ Eintra and Vinter results in a more complicated kriging prob- Another atomic energy component may be, on average, lem than an example of the subcomponents (TA) forming predicted to be less stable than the true energy (i.e. under- these energies. However, our overall similar performance is estimated). As a result, the summative combination of both still very promising given that we have the complete energy an over- and underestimated result allows for some can- of an atom A (and thus a molecule when summing over cellation, resulting in an overall energy prediction being A) being modelled with comparable errors, albeit using more accurate, when only these two energies are consid- larger training sets. Another advantage of this investigation ered. Accumulating these cancellations further across many is the ability to krige only one, two or three energies, yet energy models and across many atoms increases this effect still capturing the energetic behaviour of the whole mol- dramatically. ecule. Hence, this design saves substantial computational Despite the cancellation appearing to rely on chance, time by not needing to krige every individual IQA atomic since there is no control for the over- or underestimation A AA AA AA′ AA′ AA′ AA′ energy (T , Ven, Vee, Ven , Vne , Vnn and Vee ) (should of the energy models, the prediction of the kriging models Vee still remain unpartitioned). It is also noted that across will always be consistent, i.e. the same geometrical features all atoms in all three systems investigated here, the MAE are used to map all atomic energies within a single atom,

1 3 Theor Chem Acc (2016) 135:195 Page 15 of 19 195 and each atom’s geometrical features are related to every -1198360 Training other atom in the molecule. Hence, if one energy is overes- ) Test timated and the another is underestimated with those iden- -1 -1198410 tical or related features, then this same interplay between the kriging models will always be present. Naturally, the opposite can also occur where, for example, two overesti- -1198460 mating models result in a summed higher total error instead of cancelling. However, from the results in Figs. 6, 9 and A -1198510 12, it is evident that the summative energies (EIQA and AA′ Einter ) generally have lower errors than by simply summing Molecular Energy (kJmol the absolute errors of the components that make up these ′ ′ ′ -1198560 energies (EA EAA and V AA V AA , respectively). Evi- 0 500 1000 1500 2000 2500 3000 intra + inter XC + cl dence of the molecular cancellation (described in Sect. 3.2) Samples can also be seen, given that atomic MAE often are between 1 Fig. 13 Set of 3000 training samples and 500 test set samples, with 0.5 and 1.5 kJ mol− , but the resulting molecular MAE is 1 randomly assigned sample numbers (x-axis), plotted against their always 1 kJ mol− (for NMA). ≤ molecular energies (y-axis) for Gly. The Rogue Test Point 1 (RTP1) is encircled green, RTP2 orange, and the Test Point TP1 black 5.3 S‑curve analysis

Molec The MAEs (and MAE percentage errors) of the total EIQA molecular energies and separated according to which set molecular energy, as given in Tables 1, 2 and 3 for each they belong to, that is, the training set in blue or the test respective molecule, are arguably the most important val- set in red. The information shown in Fig. 13 is essentially a ues obtained in this investigation. Therefore, these val- one-dimensional distribution of molecular energies (y-axis) ues are representative of an overall quality check for this but spread out in two dimensions by introducing an x-axis investigation. Through averaging the MAE of the three that merely counts the 3000 training set samples and the approaches (A, B and C) for each molecule, we obtain 500 test set samples. The test set remains identical for all 1 a single MAE error for each molecule: 0.3 kJ mol− three modelling approaches. 1 1 (methanol), 0.7 kJ mol− (NMA) and 0.9 kJ mol− (Gly). A large vertical gap between the blue points in Fig. 13 1 These <1 kJ mol− results become more impressive when indicates a lack of training points in that energy region. the total energy for each system is compared: metha- On the other hand, continuous lines indicate a high density 1 nol has a molecular energy of ~302,000 kJ mol− , NMA of points, covering well the corresponding energy region. 1 1 1 ~648,700 kJ mol− and Gly ~1,191,500 kJ mol− . The point in the test set at 1,198,412.1 kJ mol− (encir- − Of the three approaches, approach A consistently is the cled green in Fig. 13) is the geometry corresponding to 1 most accurate for each molecule. This is a result of the min- the maximum predicted error (8.9 kJ mol− ) seen in the imal approach incorporating only a single model per atom S-curve of approach A (utmost right point in the red curve in the molecule. However, the performance of approaches in Fig. 11), denoted RTP1 (Rogue Test Point 1). The point 1 B and C were less distinguishable or predictable. Either in the test set at 1,198,406.7 kJ mol− (encircled orange − approach was capable of being slightly more accurate than in Fig. 13) is the geometry corresponding to the maximum 1 the other. However, the molecular error when calculated error (7.9 kJ mol− ) seen in the S-curve of approach B 1 using either approach B or C was always within 0.3 kJ mol− (utmost right point in blue curve in Fig. 11), denoted RTP2. of one another. In conclusion, we consider all three routes as Both these molecular energies appear to be reasonably suitable candidates for modelling the molecular energy, each well sampled in the training set, with nearby (blue) points 1 incorporating a different level of chemical insight. of 1,198,412.5 and 1,198,406.9 kJ mol− . Unusually, − − One further point to note in the analysis of our S-curves, there was no problem for approaches B and C in predict- are the “rogue points” present near the 100 % ceiling of the ing the molecular energy for RTP1, with errors of 3.4 and 1 plots. Few rogue points occur for approach B in methanol, 1.3 kJ mol− , respectively. Similarly, for RTP2, errors of 1 but more noticeably for approaches B and, in particular, 1.1 and 3.1 kJ mol− were obtained for approaches A and approach A for Gly. These points are considered rogue due C. This suggests that it is generally not a lack of training to the large gap that appears separating these points from geometries that are causing the large rogue errors. Instead the almost continuous S-like shape of the plot. Figure 13 it could be any one (or combination) of the following three (on capped glycine) sheds lights on how rogue points arise. effects: (1) the potential energy surface around these points In Fig. 13, the glycine geometries that passed GAIA’s is undulant (in general for the molecule, or for a particu- scrubbing procedure are plotted according to their lar IQA energy) and the training geometries included are

1 3 195 Page 16 of 19 Theor Chem Acc (2016) 135:195 not enough to fully capture this landscape adequately, (2) were the most difficult atoms to model, with a maximum 1 the test points are outside of the domain of applicability, MAE value of ~1.2–1.3 kJ mol− , in both NMA and Gly. defined as the region of conformational space that can be The carbonyl carbon was followed by similar maximum interpolated by the training points of the kriging model. In MAEs values for the α-carbon, the amino nitrogens and 1 other words, points lying outside of the domain of appli- methyl carbons of ~0.5 kJ mol− , in methanol, NMA and cability correspond to points that lie outside of the train- Gly. The oxygens were easily distinguishable with a much 1 ing set, and so the kriging model is required to perform an lower maximum MAE of around 0.1 kJ mol− , followed extrapolation to make a prediction [74], (3) when summing by the consistently very accurately modelled hydrogens 1 across models, the balance of accumulating errors results in with maximal MAEs of <0.1 kJ mol− . It is interesting to reduced cancellation of errors. All these reasons would also note that the N1 atom in NMA, resulted in similar errors be supported by working with the higher-energy geometries to that of the corresponding atoms N4 and N10 in Gly. The where there are fewer samples in the training set, compared same is true of the NMA atoms carbonylic C2 (with C6 to those closer to the energy of the seed minimum. and C8 in Gly) and O3 (with O7 and O9 in Gly), within 1 Another measure that is useful when analysing the 0.2 kJ mol− . Hence, with atoms having comparable ± cause of a poor prediction within FEREBUS is the mean MAEs across multiple molecules, there is some basic evi- signed error (MSE) (or mean signed deviation, MSD). In dence of atom typing. However, it is not a rule that can be statistics, the MSE is a measure of how close a predicted used in distinguishing all present functional groups, as evi- value matches the true quantity. Having a high MSE for a denced by the difficulty in distinguishing α-carbon, methyl particular prediction indicates that the model is not well carbon and the amino nitrogen groups, using only the trained for in that region, and is a hallmark of working out- MAEs. It would be interesting to see which further trends side the domain of applicability. Taking glycine’s approach are observed when a broader range of functional groups are A as an example, the C12 atom stands out as an atom with studied. A a particularly poor EIQA prediction for RTP1 (an error of The discussion above is not the first time atom typing 1 5.2 kJ mol− ). C12 also has an MSE approximately five has been considered within the study of an energy partition- times the average across all of the test geometries. Some of ing. A recent article by Patrikeev et al. [75]. investigated the other atoms in glycine also indicate a slightly increased the performance of several density functionals in their eval- A difficulty in predicting EIQA for this test geometry, but uation of both Kohn–Sham and correlation kinetic energies not to the same degree as for C12. Hence, it can be con- of topological atoms, and also commented on discrimi- cluded that C12 is the source of the error for RTP1, due to nating atom types through such atomic descriptors. Initial the model operating outside of its domain of applicability. findings for the Kohn–Sham energies indicated a strong Evidently, as approaches B and C perform well in predict- link between some of the tested functionals and the atomic ing this molecular energy, this MSE explanation either is number (or element) of an atom. A further finding in the A AA′ AA′ AA′ irrelevant when using Eintra, Vinter, Vcl , and/or VXC , or the assessment of correlation kinetic energies allowed aromatic effect is dampened by the cancellation of errors. This type and aliphatic hydrogens to be separated. It should also be of analysis can be applied to any rogue point on an S-curve. reiterated that IQA within DFT is a little tricky since the 1 In contrast, the point at 1,198,380.1 kJ mol− (encir- Kohn–Sham approach does not lead to exact correlated − cled black in Fig. 13) in the test set (TP1) (which is not reduced density matrices [76]. a rogue point), appears to be the least well sampled in the training set, but the predicted errors for this sample are 1 1 0.3 kJ mol− (approach A), 2.4 kJ mol− (approach B) and 6 A note on dispersion and transferability 1 2.0 kJ mol− (approach C), thus, not near the maximal points on each of the S-curves. The unexpectedly good pre- The only type of energy contribution that is lacking in the diction of TP1 is credited to kriging’s impressive interpola- current kriging treatment of all IQA energy contributions tion between two largely spaced training points. is that associated with dispersion. Admittedly, the current treatment includes electron correlation, but because we 5.4 Evidence for atom typing used B3LYP this electron correlation does not cover disper- sion effects. However, soon-to-be-published work of our A Figures 6, 9 and 12 illustrated the ‘difficulty’ of modelling group successfully kriges the IQA intra-atomic, Eintra, and AA′ each atomic energy, according to the MAE. From this anal- interatomic, Vinter, energies at the M06-2X/aug-cc-pVDZ ysis, we learned that atoms belonging to a particular func- level of theory. This functional describes (or mimics) tional group had a MAE that distinguished some from oth- some mid-range dispersion effects, but the ultimate goal ers, independently of the IQA energies being used for this of FFLUX is to invoke a post-Hartree–Fock method (non- observation. Across our three systems, the carbonyl carbons DFT) to cover dispersion properly.

1 3 Theor Chem Acc (2016) 135:195 Page 17 of 19 195

A second note concerns the transferability of the integrated into FFLUX. Future work will build upon the obtained model with respect to an exchange–correlation method presented here employing the models in the appli- functional other than B3LYP. The only other exchange– cation of geometry optimization, initially without, but later correlation functionals implemented in the program AIM- with, the incorporation of multipolar electrostatics. Future ALL are LSDA and M06-2X. The current investigation has research is also focussing on creating intelligent training not been mirrored using any other functional from which sets, designed to reduce the number of samples used in the we could directly compare transferability results. However, building of kriging models. there is potential for transferability to be considered in the aforementioned work to be published, where M06-2X was Acknowledgments We acknowledge the EPSRC for funding through used. Like in the current work, IQA atomic energy predic- the award of an Established Career Fellowship to P.L.A.P. (Grant EP/ K005472), and the BBSRC for studentships to S.C. and P.M. tions made for multiple different systems could be com- pared. Some understanding of the transferability of another Open Access This article is distributed under the terms of the Crea- exchange–correlation functional can come from earlier tive Commons Attribution 4.0 International License (http://crea- work [77] from our laboratory in which the electrostatic tivecommons.org/licenses/by/4.0/), which permits unrestricted use, energy, obtained through atomic multipole moments, was distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a kriged at three levels of theory, namely HF, B3LYP and link to the Creative Commons license, and indicate if changes were M06-2X. From that work, one would expect that M06-2X made. will perform similarly to B3LYP.

References 7 Conclusion 1. Piana S, Lindorff-Larsen K, Shaw DE (2011) How robust are The development of the novel force field FFLUX now protein folding simulations with respect to force field parameter- ization? Biophys J 100:L47–L49 moves beyond its machine learning treatment of multipo- 2. Rauscher S, Gapsys V, Gajda MJ, Zweckstetter M, de Groot BL, lar electrostatics. We demonstrate that short-range non- Grubmüller H (2015) Structural ensembles of intrinsically dis- multipolar electrostatics can now also be kriged success- ordered proteins depend strongly on force field: a comparison to fully. Moreover, (non-multipolar) exchange energies as experiment. J Chem Theor Comput 11:5513–5524 3. Ponder JW, Wu C, Pande VS, Chodera JD, Schnieders MJ, well as intra-atomic energies (beyond just the kinetic Haque I, Mobley DL, Lambrecht DS, DiStasio RAJ, Head-Gor- energy) are now also kriged with promising energy errors. don M, Clark GNI, Johnson ME, Head-Gordon T (2010) Current As a result, chemical bonding and stereo-electronic effects status of the AMOEBA polarizable force field. J Phys Chem B are now, by way of principle, incorporated in FFLUX. This 114:2549–2564 4. Vinter JG (1994) Extended electron distributions applied to the achievement is realized within the context of the methodol- molecular mechanics of some intermolecular interactions. J ogy of interacting quantum atoms (IQA). Comput Aided Mol Des 8:653–668 Three approaches (A, B, and C), incorporating five 5. Gresh N, Cisneros GA, Darden TA, Piquemal J-P (2007) Ani- ′ ′ ′ IQA atomic energies (EA , EA , V AA , V AA and V AA ), sotropic, polarizable molecular mechanics studies of inter- IQA intra inter XC cl and intramoecular interactions and ligand-macromolecule were successfully used to develop molecular models offer- complexes. A bottom-up strategy. J Chem Theory Comput ing control in balancing accuracy and chemical insight. 3:1960–1986 The most accurate and least expensive molecular model 6. Verstraelen T, Vandenbrande S, Ayers PW (2014) Direct compu- A tation of parameters for accurate polarizable force fields. J Chem (approach A) was built using the total atomic energy EIQA, 1 Phys 141:194114 with MAEs of 0.3, 0.4 and 0.6 kJ mol− for methanol, ± 7. Cardamone S, Hughes TJ, Popelier PLA (2014) Multipolar elec- NMA and capped glycine, respectively. Interestingly, the trostatics. Phys Chem Chem Phys 16:10367–10387 more insightful formalisms involving the intra- and inter- 8. Kramer C, Spinn A, Liedl KR (2014) Charge anisotropy: atomic components (approach B), and the interatomic where atomic multipoles matter most. J Chem Theory Comput 10:4488–4496 exchange and electrostatic contributions (approach C), 9. Kosov DS, Popelier PLA (2000) Convergence of the multipole resulted in similar MAEs. These errors are on a par with expansion for electrostatic potentials of finite topological atoms. previous literature and are a result of the combination of J Chem Phys 113:3969–3974 models benefitting from cancellation of errors. The latter 10. Popelier PLA, Joubert L, Kosov DS (2001) Convergence of the electrostatic interaction based on topological atoms. J Phys occur both within an atom’s total energy modelling (atomic Chem A 105:8254–8261 cancellation), and also when summing across total atomic 11. Popelier PLA, Kosov DS (2001) Atom-atom partitioning of models (molecular cancellation) to obtain the molecular intramolecular and intermolecular Coulomb energy. J Chem model. Phys 114:6539–6547 12. Popelier PLA, Rafat M (2003) The electrostatic potential gener- In summary, the novel strategy and results were a suc- ated by topological atoms: a continuous multipole method lead- cessful proof-of-concept approach, developed to be ing to larger convergence regions. Chem Phys Lett 376:148–153

1 3 195 Page 18 of 19 Theor Chem Acc (2016) 135:195

13. Rafat M, Popelier PLA (2005) The electrostatic potential gener- 36. Yuan Y (2012) A polarisable multipolar force field for pepides ated by topological atoms. Part II: inverse multipole moments. J based on kriging: towards application in protein crystallography Chem Phys 123(204103–204101):204107 and enzymatic reactions. Ph.D. thesis, School of Chemistry, Uni- 14. Rafat M, Popelier PLA (2006) A convergent multipole expan- versity of Manchester sion for 1,3 and 1,4 Coulomb interactions. J Chem Phys 37. Mills MJL, Popelier PLA (2011) Intramolecular polarisable 124(144102):1–7 multipolar electrostatics from the machine learning method krig- 15. Rafat M, Popelier PLA (2007) Topological atom-atom partition- ing. Comput Theor Chem 975:42–51 ing of molecular exchange energy and its multipolar conver- 38. Darley MG, Popelier PLA (2008) Role of short-range electro- gence. In: Matta CF, Boyd RJ (eds) Quantum theory of atoms in statics in torsional potentials. J Phys Chem A 112:12954–12965 molecules, vol 5. Wiley-VCH, Weinheim, pp 121–140 39. Fletcher TL, Kandathil SM, Popelier PLA (2014) The predic- 16. Rafat M, Popelier PLA (2007) Long range behaviour of tion of atomic kinetic energies from coordinates of surround- high-rank topological multipole moments. J Comput Chem ing atoms using kriging machine learning. Theor Chem Acc 28:832–838 133(1499):1410–1491 17. Joubert L, Popelier PLA (2002) The prediction of energies and 40. Blanco MA, Martin Pendas A, Francisco E (2005) Interacting geometries of hydrogen bonded DNA base-pairs via a topologi- Quantum Atoms: a correlated energy decomposition scheme cal electrostatic potential. Phys Chem Chem Phys 4:4353–4359 based on the quantum theory of atoms in molecules. J Chem 18. Bader RFW (1990) Atoms in molecules. A quantum theory. Theor Comput 1:1096–1109 Oxford University Press, Oxford 41. Maxwell P, Martin Pendas A, Popelier PLA (2016) Extension of 19. Popelier PLA (2000) Atoms in molecules. An introduction. Pear- the interacting quantum atoms (IQA) approach to B3LYP level son Education, London density functional theory. Phys Chem Chem Phys [Epub ahead 20. Popelier PLA (2014) The quantum theory of atoms in molecules, of print] Chapter 8. In: Frenking G, Shaik S (eds) The nature of the chem- 42. Rafat M, Devereux M, Popelier PLA (2005) Rendering of ical bond revisited. Wiley-VCH, Weinheim, pp 271–308 quantum topological atoms and bonds. J Mol Graph Model 21. Matta CF, Boyd RJ (2007) The quantum theory of atoms in mol- 24:111–120 ecules. From solid state to DNA and drug design. Wiley-VCH, 43. Rafat M, Popelier PLA (2007) Visualisation and integration of Weinheim quantum topological atoms by spatial discretisation into finite 22. Popelier PLA (2015) QCTFF: on the construction of a novel pro- elements. J Comput Chem 28:2602–2617 tein force field. Int J Quantum Chem 115:1005–1011 44. Silvi B, Savin A (1994) Classification of chemical bonds based 23. Bader RFW, Popelier PLA (1993) Atomic theorems. Int J Quan- on topological analysis of electron localization functions. Nature tum Chem 45:189–207 (London) 371:683–686 24. Yuan Y, Mills MJL, Popelier PLA (2014) Multipolar electrostat- 45. Popelier PLA (2005) Quantum chemical topology: on bonds and ics for proteins: atom–atom electrostatic energies in crambin. J potentials. In: Wales DJ (ed) Structure and bonding. Intermolec- Comput Chem 35:343–359 ular forces and clusters, vol 115. Springer, Heidelberg, pp 1–56 25. Handley CM, Popelier PLA (2010) Potential energy surfaces fit- 46. Popelier PLA (2016) Molecular simulation by knowledgeable ted by artificial neural networks. J Phys Chem A 114:3371–3383 quantum atoms. Phys Scr 91:033007 26. Handley CM, Popelier PLA (2009) A dynamically polarizable 47. Popelier PLA, Aicken FM (2003) Atomic properties of selected water potential based on multipole moments trained by machine biomolecules: quantum topological atom types of carbon occur- learning. J Chem Theory Comput 5:1474–1489 ring in natural amino acids and derived molecules. J Am Chem 27. Handley CM, Hawe GI, Kell DB, Popelier PLA (2009) Optimal Soc 125:1284–1292 construction of a fast and accurate polarisable water potential 48. Popelier PLA (2016) Quantum chemical topology. In: Mingos based on multipole moments trained by machine learning. Phys M (ed) The chemical bond—100 years old and getting stronger. Chem Chem Phys 11:6365–6376 Springer, Cham, pp 71–117 28. Cressie N (1993) Statistics for spatial data. Wiley, New York 49. Popelier PLA (2012) New insights in atom–atom interactions for 29. Rasmussen CE, Williams CKI (2006) Gaussian processes for future drug design. Curr Top Med Chem 12:1924–1934 machine learning. The MIT Press, Cambridge 50. Schneider N, Lange G, Hindle S, Klein R, Rarey M (2013) A 30. Rupp M, Ramakrishnan R, von Lilienfeld OA (2015) Machine consistent description of HYdrogen bond and DEhydration ener- learning for quantum mechanical properties of atoms in mole- gies in protein–ligand complexes: methods behind the HYDE cules. J Phys Chem Lett 6:3309–3313 scoring function. J Comput Aided Mol Des 27:15–29 31. Fletcher TL, Davie SJ, Popelier PLA (2014) Prediction of intra- 51. Bader RFW, Beddall PM (1972) Virial field relationship for molecular polarization of aromatic amino acids using kriging molecular charge distributions and the spatial partitioning of machine learning. J Chem Theory Comput 10:3708–3719 molecular properties. J Chem Phys 56:3320–3329 32. Fletcher TL (2014) Polarizable multipolar electrostatics driven 52. Todd A Keith, AIMAll (Version 15.09.12) (2015) TK Gristmill by kriging machine learning or a peptide force field: assessment, Software, Overland Park KS, USA. http://aim.tkgristmill.com improvement and up-scaling. Ph.D. Thesis, School of Chemistry, 53. Chávez-Calvillo R, García-Revilla M, Francisco E, Martín- University of Manchester Pendás A, Rocha-Rinza T (2015) Dynamical correlation within 33. Kandathil SM, Fletcher TL, Yuan Y, Knowles J, Popelier PLA the interacting quantum atoms method through coupled cluster (2013) Accuracy and tractability of a kriging model of intramo- theory. Comput Theor Chem 1053:90–95 lecular polarizable multipolar electrostatics and its application to 54. Eskandari K, Van Alsenoy C (2014) Hydrogen–hydrogen inter- histidine. J Comput Chem 34:1850–1861 action in planar biphenyl: a theoretical study based on the inter- 34. Yuan Y, Mills MJL, Popelier PLA (2014) Multipolar electrostat- acting quantum atoms and Hirshfeld atomic energy partitioning ics based on the kriging machine learning method: an application methods. J Comput Chem 35:1883–1889 to serine. J Mol Model 20:2172–2186 55. Dillen J (2013) Congested molecules. Where is the steric 35. Mills MJL, Popelier PLA (2012) Polarisable multipolar electro- repulsion? An analysis of the electron density by the statics from the machine learning method kriging: an application method of interacting quantum atoms. Int J Quantum Chem to alanine. Theor Chem Acc 131:1137–1153 113:2143–2153

1 3 Theor Chem Acc (2016) 135:195 Page 19 of 19 195

56. Martin Pendas A, Blanco MA, Francisco E (2006) The nature 69. Jensen F (2002) Polarization consistent basis sets. III. The of the hydrogen bond: a synthesis from the interacting quantum importance of diffuse functions. J Chem Phys 117:9234–9240 atoms picture. J Chem Phys 125:184112 70. Martin Pendas A, Francisco E, Blanco MA, Gatti C (2007) 57. Martin Pendas A, Francisco E, Blanco MA (2006) Binding ener- Bond paths as privileged exchange channels. Chem Eur J gies of first row diatomics in the light of the interacting quantum 13:9362–9371 atoms approach. J Phys Chem A 110:12864–12869 71. Francisco E, Martin Pendas A, Blanco MA (2006) A molecular 58. Inostroza-Rivera R, Yahia-Ouahmed M, Tognetti V, Joubert L, energy decomposition scheme for atoms in molecules. J Chem Herrera B, Toro-Labbe A (2015) Atomic decomposition of con- Theor Comput 2:90–102 ceptual DFT descriptors: application to proton transfer reactions. 72. Martin Pendas A, Blanco MA, Francisco E (2007) Chemical Phys Chem Chem Phys 17:17797–17808 fragments in real space: definitions, properties, and energetic 59. Di Pasquale N, Davie SJ, Popelier PLA (2016) Optimization decompositions. J Comput Chem 28:161–184 algorithms in optimal predictions of atomistic properties by krig- 73. Frisch MJ, Trucks GW, Schlegel HB, Scuseria GE, Robb MA, ing. J Chem Theor Comput 12:1499–1513 Cheeseman JR, Scalmani G, Barone V, Mennucci B, Petersson 60. Kennedy J, Eberhart RC (1995) Particle swarm optimization. In: GA, Nakatsuji H, Caricato M, Li X, Hratchian HP, Izmaylov AF, Proceedings of the IEEE international conference on neural net- Bloino J, Zheng G, Sonnenberg JL, Hada M, Ehara M, Toyota K, works, vol 4, pp 1942–1948 Fukuda R, Hasegawa J, Ishida M, Nakajima T, Honda Y, Kitao 61. Popelier PLA (2012) Quantum chemical topology: knowledgea- O, Nakai H, Vreven T, Montgomery JJA, Peralta JE, Ogliaro ble atoms in peptides. In: AIP conference proceedings, vol 1456, F, Bearpark M, Heyd JJ, Brothers E, Kudin KN, Staroverov pp 261–268 VN, Kobayashi R, Normand J, Raghavachari K, Rendell A, 62. Amadei A, Linssen ABM, Berendsen HJC (1993) Essential Burant JC, Iyengar SS, Tomasi J, Cossi M, Rega N, Millam NJ, dynamics of proteins. Proteins Struct Funct Genet 17:412–425 Klene M, Knox JE, Cross JB, Bakken V, Adamo C, Jaramillo 63. Balsera MA, Wriggers W, Oono Y, Schulten K (1996) Princi- J, Gomperts R, Stratmann RE, Yazyev O, Austin AJ, Cammi R, pal component analysis and long time protein dynamics. J Phys Pomelli C, Ochterski JW, Martin RL, Morokuma K, Zakrzewski Chem 100:2567–2572 VG, Voth GA, Salvador P, Dannenberg JJ, Dapprich S, Daniels 64. Brooks B, Karplus M (1983) Harmonic dynamics of proteins: AD, Farkas Ö, Foresman JB, Ortiz JV, Cioslowski J, Fox DJ normal modes and fluctuations in bovine pancreatic trypsin (2009) Gaussian 09. Gaussian Inc., Wallingford inhibitor. Proc Natl Acad Sci USA 80:6571–6575 74. Weaver S, Gleeson MP (2008) The importance of the domain of 65. Wilson D, Decius J, Paul C (1955) Cross, molecular vibrations. applicability in QSAR modeling. J Mol Graph Model 26:1315–1326 McGraw-Hill, New York 75. Patrikeev L, Joubert L, Tognetti V (2016) Atomic decomposition 66. Ochterski JW (1999) Vibrational analysis in Gaussian. Vibra- of Kohn–Sham molecular energies: the kinetic energy compo- tional Analysis in Gaussian. http://www.gaussian.com/g_white- nent. Mol Phys 114:1285–1296 pap/vib.htm 76. Tognetti V, Joubert L (2014) Density functional theory and Bad- 67. Hughes TJ, Cardamone S, Popelier PLA (2015) Realistic sam- er’s atoms-in molecules theory: towards a vivid dialogue. Phys pling of amino acid geometries for a multipolar polarizable force Chem Chem Phys 16:14539–14550 field. J Comput Chem 36:1844–1857 77. Hughes TJ, Kandathil SM, Popelier PLA (2015) Accurate pre- 68. Yuan Y, Mills MJL, Popelier PLA, Jensen F (2014) Comprehen- diction of polarised high order electrostatic interactions for sive analysis of energy minima of the 20 natural amino acids. J hydrogen bonded complexes using the machine learning method Phys Chem A 118:7876–7891 kriging. Spectrochim Acta A 136:32–41

1 3 PCCP

View Article Online PAPER View Journal | View Issue

The computational prediction of Raman and ROA spectra of charged histidine tautomers in Cite this: Phys. Chem. Chem. Phys., 2016, 18, 27377 aqueous solution

Salvatore Cardamone,ab Beth A. Caine,ab Ewan Blanch,c Maria G. Lizioab and Paul L. A. Popelier*ab

Histidine is a key component of a number of enzymatic mechanisms, and undertakes a myriad of functionalities in biochemical systems. Its computational modelling can be problematic, as its capacity to take on a number of distinct formal charge states, and tautomers thereof, is difficult to capture by conventional techniques. We demonstrate a means for recovering the experimental Raman optical activity (ROA) spectra of histidine to a high degree of accuracy. The resultant concordance between experiment and theory is of particular importance in characterising physically insightful quantities, such as band assignments. We introduce a novel conformer selection scheme that unambiguously parses Creative Commons Attribution 3.0 Unported Licence. snapshots from a molecular dynamics trajectory into a smaller conformational ensemble, suitable for reproducing experimental spectra. We show that the ‘‘dissimilarity’’ of the conformers within the resultant ensemble is maximised and representative of the physically relevant regions of molecular conformational space. In addition, we present a conformer optimisation strategy that significantly Received 19th August 2016, reduces the computational costs associated with alternative optimisation strategies. This conformer Accepted 12th September 2016 optimisation strategy yields spectra of equivalent quality to those of the aforementioned alternative DOI: 10.1039/c6cp05744f optimisation strategies. Finally, we demonstrate that microsolvated models of small molecules yield spectra that are comparable in quality to those obtained from ab initio calculations involving a large

This article is licensed under a www.rsc.org/pccp number of solvent molecules.

more robust analysis of biomolecular structure than more Open Access Article. Published on 12 September 2016. Downloaded 21/11/2016 20:58:12. 1. Introduction common forms of vibrational spectroscopy because of its Raman optical activity (ROA) is a powerful chiroptical technique sensitivity to chirality. The ROA signal of a system is largely that can be used to probe a number of molecular properties. dominated by the more rigid and chiral elements of that system, In the early days of its inception, ROA was used to deduce and in the case of proteins and peptides this corresponds to stereochemical information of molecular species by compar- the secondary structure motifs exhibited by the polypeptides. ison with band patterns of related molecules.1 Unlike a number This makes ROA highly useful for structural characterisation of of other spectroscopies, ROA intensities in the low wavenumber complex biomolecules, in contrast to Raman spectroscopy, region of the spectrum are rich in information, since these regions where amino acid sidechain signals typically dominate the correspond to, for example, torsional degrees of freedom2 and spectrum.6 ROA is by no means constrained to peptide systems, large scale conformational modes of motion.3,4 ROA also possesses and the ROA signals of other biochemical species, such as DNA the capacity to determine the proportions of enantiomers within a bases7 and oligomeric nucleic acids,8 also prove informative. The sample, which is of great importance in purification processes.5 utility of ROA in structural analysis has been exemplified over The structural features of biochemical systems make them the years by a number of studies on the folding,6,9 assembly10 particularly amenable to ROA spectroscopy. ROA offers a far and dynamical properties11 of biochemical systems. Histidine is somewhat unique in its ability to perform a a Manchester Institute of Biotechnology (MIB), 131 Princess Street, number of biochemical roles. Traditionally characterised as a Manchester M1 7DN, UK. E-mail: [email protected]; polar amino acid, histidine does not substitute particularly well Tel: +44 (0)161 3064511 with other amino acids.12 The nitrogen atoms of the imidazole b School of Chemistry, University of Manchester, Oxford Road, Manchester M13 9PL, UK group have long been known to act as proton shuttles, and c School of Science, Royal Melbourne Institute of Technology (RMIT), grant functionality for a number of enzymatic mechanisms. 124a La Trobe Street, Melbourne, VIC 3001, Australia However, histidine has also been shown to participate in

This journal is © the Owner Societies 2016 Phys. Chem. Chem. Phys., 2016, 18, 27377--27389 | 27377 View Article Online

Paper PCCP

interactions that are typically perceived to involve large aro- Raman and ROA spectra. We have investigated only the His0 0 0 1+ matic amino acids such as tryptophan and phenylalanine. (i.e. both His [NtH]and His [NpH]) and His charge states here, Noncovalent interactions involving histidine have been shown since these are the primary forms that histidine takes in the to exist in the form of stacking between its imidazole ring and a majority of biochemical systems at physiological pH. The issues number of DNA nucleobases.13 of conformer selection, conformer solvation and conformer Histidine features in the ‘‘catalytic triad’’ of a number of optimisation are each tackled independently, and their effects biochemically important enzymes, such as serine and cysteine are demonstrated on both tautomeric systems and formally proteases, where it acts as a proton shuttle during the process charged systems. of protein catabolism.14 The Manzetti mechanism15 suggests that matrix metalloproteinases, a set of well-characterised cancer therapeutic targets, attain catalysis through a pair of histidine 2. The problem of conformer residues that coordinate a Zn2+ ion. The penta-coordinate Zn2+ generation ion is induced to act as a reversible electron donor, and can hydrolyse the scissile peptide bond. Indeed, research within the 2.1 Solvation field of biocatalysis has found evidence for histidine residues The fact that biochemical systems are stabilised by solvent within tripeptide motifs co-ordinating even more exotic metal interactions makes their modelling problematic. Since the ions, such as Pt2+ and Au3+, and exhibiting chemotherapeutic solvent plays an important role in dictating the energetics of properties.16 a molecular system, it becomes imperative that spectroscopic Five protonation states of histidine exist: His2+,His1+,His0, predictions include some level of solvent modelling. It has been His1 and His2, where the superscript denotes the formal shown that simply omitting the solvent from spectroscopic charge state. In addition to these protonation states, the imida- calculations yields poor quality results that severely digress zole ring in His0 and His1 can exist in one of two tautomeric from experiment.18 Indeed, when dealing with charged species,

Creative Commons Attribution 3.0 Unported Licence. forms. This effect originates from the lone pairs of the nitrogen such as zwitterionic amino acids, the molecules are energetically atoms contributing to p-delocalisation around the five-membered only stable if the solvent is included,19 and optimise to the neutral

imidazole ring. The imidazole nitrogen atoms are labelled Np or amine-carboxylic acid species in the absence of constraints.

Nt, the former being the nitrogen one bond away from the The two major models for solvation effects in ab initio calcula- b carbon in the alkyl sidechain, and the latter being the nitrogen tions are implicit and explicit solvation. With the former,20 the two bonds away from b carbon,17 asshowninFig.1.Forconve- solvent environment is modelled as a homogeneous dielectric nience, we also depict the major torsional degrees of freedom that medium, reducing the magnitude of electrostatic interactions are typically varied for conformational studies of histidine in relative to the system in vacuo. The Polarisable Continuum Fig. 1; (j,c) are the conventional Ramachandran dihedrals, Model21 (PCM) and Conductor-like Screening Model22 (COSMO)

This article is licensed under a 23–25 while w1 and w2 are the side chain dihedrals. Tautomeric forms have been effective in predicting molecular energies, but are denoted by specifying the protonated nitrogen, i.e. NpHor lack any atomic-level properties of the solvent, such as the

NtH. Using this nomenclature, we can uniquely label the seven solvent’s ability to form hydrogen bonds with a solute. Implicit microstates of histidine: His2+, His1+, His0[N H], His0[N H], solvation methods have been found to offer minimal benefits in Open Access Article. Published on 12 September 2016. Downloaded 21/11/2016 20:58:12. p t 1 1 2 His [NpH], His [NtH] and His . We use the nomenclature computing vibrational frequencies, and therefore vibrational ‘‘microstate’’ throughout this work in reference to one of these spectra, relative to gas phase calculations.26 seven forms of histidine. A number of explicit solvation schemes have been investi- In this work, we present a number of developments that we gated. At the lowest end, one can include a small number of have instituted and tested for the accurate computation of both solvent molecules and treat the entire solute–solvent system quantum mechanically, perhaps embedded in an implicit solvation field. The number of explicit solvent molecules treated in this way is arbitrary, and their positioning around the solute can be problematic. Solvent molecules can be added in an ad hoc fashion around polar (or non-polar if the solvent is non-polar) groups, leading to significant improvement over gas phase spectra.27 However, a number of problems can also be attributed to this methodology, such as poor optimisation of the system geometry owing to its distance from a minimum on the potential energy surface.28 Another more popular method is the use of QM/MM, where the solvent is treated classically and the solute quantum mechanically. This has proven to be a very successful method 29 0 for calculating spectra. For example, Cheeseman and coworkers Fig. 1 Zwitterionic histidine (His [NtH]) where the imidazole nitrogen atoms have been labelled as indicated in the text. The two sidechain have predicted the methyl-b-D-glucose ROA spectrum to an

dihedral angles, w1 and w2, are also labelled. impressive level of accuracy, reproducing the vast majority of

27378 | Phys. Chem. Chem. Phys., 2016, 18, 27377--27389 This journal is © the Owner Societies 2016 View Article Online

PCCP Paper

spectral features. More recently, this approach has been further We have conducted 100 ns molecular dynamics trajectories 0 0 1+ validated on a number of other systems, such as glucuronic for His [NtH], His [NpH] and His , solvated in boxes comprising 30 31 acid and b-D-xylose. However, one is still constrained in 469 water molecules, with a chloride ion added in the latter case selecting a conservative number of solvent molecules to include to ensure the system remained electrically neutral. Throughout, in the calculation for it to remain tractable; simply including we have used the GROMACS33 molecular dynamics package in hundreds of solvent molecules in the QM/MM calculation is not conjunction with the OPLS-AA34 force field. routinely feasible. Energy minimisation of these systems was initially under- taken with the steepest descent method, until convergence 2.2 Optimisation of the total system energy was attained. Following energy Complications arise in the QM/MM method of modelling minimisation, we have conducted two separate equilibration solvent during the geometrical optimisation of the system. phases. The first was a 1 ns NVT equilibration, using the Berendsen For the normal modes of motion of the system to be correctly thermostat to maintain a temperature of 300 K, whilst the second determined, the system is required to relax to an energetic was a 1 ns NPT equilibration using the Parrinello–Rahman minimum on the potential energy surface. Geometry optimisa- barostat to maintain both a temperature of 300 K and a pressure tion is notoriously difficult for a system including even a small of 0.1 MPa (1 bar). The particle mesh Ewald methodology was number of explicit solvent molecules;30,31 the inclusion of water used to treat long-range electrostatics. A cut-off distance of 10 Å molecules in these studies flattens the potential energy surface, was chosen for Coulombic and van der Waals interactions. Both making it difficult to find well-defined minima. Two methods stages of equilibration were verified as being properly equilibrated to resolve this problem have therefore been proposed: OptAll by checking the convergence of the temperature for the NVT and OptSolute. The OptAll scheme optimises the entire equilibration and the density of the system for the NPT equili- QM/MM system (with a less rigorous convergence threshold than bration. Both showed convergence after roughly 100 ps, and so is usually implemented), while the OptSolute method freezes all our equilibration was assumed to be sufficient.

Creative Commons Attribution 3.0 Unported Licence. solvent molecules and optimises onlythesoluteinthefieldof Our final step involved a 100 ns production molecular the static solvent. The latter method is naturally much quicker dynamics run in the NPT ensemble at a temperature of 300 K than the former, but was found to yield poorer, although still and a pressure of 0.1 MPa. To this end, we have used the informative, quality spectra. Some trade-off between accuracy and Berendsen thermostat coupled with the isotropic Parrinello– computational tractability is to be expected, but without there Rahman barostat. The dynamical timestep used for our simula- being an accepted methodology for qualitatively comparing pre- tions was 0.5 fs, with a snapshot geometry output every 5 ps, dicted spectra to experimental spectra, it is difficult to ascertain resulting in 20 000 snapshot geometries. Coulombic and van whether the improvement in spectral quality warrants the addi- der Waals interactions were treated as we have described in the tional computational efforts. preceding paragraph. This article is licensed under a 3.3 Filtering of similar geometries T 3. Technical details For the following discussion, the 3 N matrix X =[x1...xN] , N

Open Access Article. Published on 12 September 2016. Downloaded 21/11/2016 20:58:12. 3.1 Experimental setup being the number of atoms in the molecule, is used to denote a molecular conformation, where x is the Cartesian position We have prepared samples of histidine at a pH of 7.8 and 4.2, i vector of the ith atom. both at a concentration of 0.26 M (equivalent to 40 mg mL1). Perhaps the most ubiquitous form of geometrical compar- Spectra were collected over a period of 7 hours, using an ison between two structures, X and Y, is the root mean square incident wavelength of 532 nm, a laser power of 700 mW and deviation (RMSD), which we denote RðX; YÞ, a 1.47 s exposition time. All spectra have been normalised so that vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi they can be directly compared with the computed spectra. The u u XN t 1 absolute amplitude is therefore irrelevant in our analysis. RðX; YÞ¼ jjx y 2 (1) N i i i¼1 3.2 Molecular dynamics The arrangement of solvent molecules is typically accomplished However, in the form of eqn (1), a number of pitfalls can be by use of molecular dynamics. The system is solvated and encountered when dealing with molecular conformations. The the resultant trajectory is parsed into a number of snapshot most significant issue is the fact that RðX; YÞ is not invariant configurations. These configurations are subsequently used as with respect to rigid transformations. For example, if X and Y the conformational ensemble for spectral calculations. Recent possess the same internal geometry, but are rigidly translated work32 has selected the snapshots entirely randomly, with very relative to one another, then eqn (1) represents the two geo- rough (qualitative) estimates used to maximise the conforma- metries as being dissimilar. The same result holds for the case tional diversity of the ensemble. The snapshots are then energe- of rigid rotations relative to one another. Of course, molecules tically minimised, and those snapshots with a high Boltzmann in homogeneous environments are entirely equivalent after weight form an ensemble that is considered representative of the having undergone rigid translations, and so the RMSD in its dominant system conformations. current incarnation is of no value for our purposes.

This journal is © the Owner Societies 2016 Phys. Chem. Chem. Phys., 2016, 18, 27377--27389 | 27379 View Article Online

Paper PCCP

We can render eqn (1) in a useful form by finding RðX; YÞ snapshots into the ensemble, until an ensemble of the required such that it is minimised with respect to all rigid rotations and size has been constructed. This ensemble then corresponds translations, i.e. by minimising the function to the set of conformers that are maximally dissimilar from the trajectory. 1 XN EðX; YÞ¼ jjUx y 2 (2) N i i i¼1 3.4 Ab initio spectrum calculation 41 where U is an orthonormal matrix, i.e. U>U¼I, the identity All ab initio calculations were undertaken using the GAUSSIAN09 42 matrix, and can therefore be interpreted as a matrix defining a software package, with the two-layer ONIOM method. Histidine rigid transformation.35 Taking the square root of EðX; YÞ there- was modelled in the high layer, and solvent was modelled in the fore corresponds to the minimised RMSD (mRMSD) between low layer. The high layer was treated at the B3LYP/6-31G(d) level 43 the two molecular conformations. A further step is required of theory, and the low layer with the AMBER99SB force field to ensure our definition of the mRMSD is robust: if the including TIP3P parameters for the water molecules. Electronic orthonormal transformation matrix U, does not constitute a embedding was used, so that the low layer electrostatics were incorporated into the quantum mechanical Hamiltonian. The right-handed system ðÞjjU o 0 , then it represents an improper rotation, i.e. a transformation that can be represented by some ab initio optimisations were performed with the Berny algorithm combination of a rotation about an axis and subsequent in the Cartesian basis, and no micro-iterations were used for reflection in a plane. Thus, we require computation of the electronic embedding. determinant of U to ensure we deal only with proper rotations. For calculation of the molecular optical activity tensors, The minimisation of eqn (2) with respect to U can be under- we invoke the analytical two-step formalism, or the n +1 44 taken in a number of ways. Historically, the orthonormality algorithm, in which the harmonic frequency and ROA tensor conditions of U have been added to eqn (2) as Lagrange multi- calculations are separated into differing levels of theory. For pliers and formal solutions obtained, as given by Kabsch,36,37 calculation of the normal coordinates and harmonic frequen- Creative Commons Attribution 3.0 Unported Licence. after whom the algorithmic formalism for solution of eqn (2) cies, we have used the B3LYP/6-31G(d) level of theory, while for is named. We proceed in deriving the form of U in a more the frequency-dependent ROA tensors, we have used the 45 physically intuitive way, elaborated upon in a number of more B3LYP/rDPS level of theory. rDPS is a rarefied basis set that recent sources.38,39 For the sake of brevity, we simply present is based on the 3-21++G basis set with semi-diffuse p-functions 46 the computable expression for U, on all hydrogen atoms. rDPS has been shown to provide a similar level of accuracy in the resulting spectra to much larger T WV ¼U (3) Dunning basis sets,45 such as aug-cc-pVDZ. An excitation wavelength of 532 nm was used for calculation where WT and V are the right- and left-eigenvectors of the T of the ROA tensors, in keeping with the conventional experi- This article is licensed under a so-called covariance matrix, XY , and the components of X and mental setup. We then obtained scattered circular polarisation Y have been centred on their respective molecular centroids. backscattered (SCP-180) ROA intensities. Individual conformer Our final consideration centres on whether U defines a spectra were Boltzmann weighted, and those with a Boltzmann proper or improper rotation, as we have previously alluded to.

Open Access Article. Published on 12 September 2016. Downloaded 21/11/2016 20:58:12. weight exceeding 0.5% were used to form the spectrum (using a We can account for this by specifying that if jjXY > o 0,thend =1, Lorentzian bandwidth of 10 cm1), for comparison with experi- otherwise d = 1, allowing us to give the final expression for U, 2 3 mental spectra. 100 6 7 6 7 6 7 > U¼W4 0105V (4) 4. Conformational ensemble 00d construction

We propose a workflow for the implementation of the Kabsch Past work47 on computing the Raman and ROA spectra of RMSD for computing a set of maximally dissimilar molecular zwitterionic histidine has implemented a simplistic conforma- conformations, the ‘‘ensemble’’, from a larger set. For our tional selection tool. Six major in vacuo conformational prefer-

work, this larger set corresponds to the MD trajectory, and ences for histidine were proposed to arise from the w1,w2

the ensemble is constructed by parsing snapshots of the MD torsional degrees of freedom: trans-plus (w2 = 1801, w1 =901);

trajectory into conformers of the ensemble. Throughout, we trans-minus (w2 = 1801, w1 = 901); gauche-plus–plus (w2 =601,

keep the notion of ‘‘trajectory’’ and ‘‘snapshot’’ distinct from w1 =901); gauche-plus–minus (w2 =601, w1 = 901); gauche-

‘‘ensemble’’ and ‘‘conformer’’. To this end, we propose the use minus–plus (w2 = 601, w1 =901); and gauche-minus–minus

of a greedy MaxMin heuristic, where points are added to the (w2 = 601, w1 = 901). These conformers were solvated in an ensemble by iteratively transferring from the trajectory those ad hoc manner (a number of water molecules were randomly snapshots that are most dissimilar to all those currently in added around the histidine in each computational model) and the ensemble, one at a time.40 So, given some initially large energetically minimised. However, the solvation of in vacuo trajectory of molecular snapshots, we compute the Kabsch energetic minima does not yield a physically realistic confor- RMSD between all snapshots and transfer maximally dissimilar mational ensemble. Intramolecular hydrogen bonds stabilise in

27380 | Phys. Chem. Chem. Phys., 2016, 18, 27377--27389 This journal is © the Owner Societies 2016 View Article Online

PCCP Paper

vacuo structures, which are not as preferable when the solute is intramolecular interactions with the zwitterionic groups are solvated. Under explicit solvation, interactions with solvent are formed. In these regions, solvent molecules are able to favour- more favourable than the intramolecular hydrogen bonds.48 ably interact with both the zwitterionic groups and the imida- In addition to this, zwitterionic amino acids are not stable zole, in keeping with the findings of ref. 48. We have therefore in vacuo. This instability necessitates the introduction of non- demonstrated the inadequacy of using in vacuo minima for the physical constraints on energetic optimisation to maintain the conformational sampling of solvated systems. 0 0 zwitterionic form. For the His spectra, we have combined both His [NpH] 0 For each system, we have utilised the Kabsch selection and His [NtH] microstate ensembles into a single ensemble. methodology outlined in Section 3.3. For each MD trajectory For our microsolvation studies, we have decided to use water comprising 20 000 snapshots, we have parsed the 20 snapshots clusters comprising 5, 10, 15 and 20 explicit water molecules. whose solute geometries are the most mutually diverse by use of a To this end, we have selected the closest [5,10,15,20] water greedy MaxMin heuristic. Solvent geometries were not included molecules to the centre of mass of the zwitterionic histidine in the Kabsch filtering. We present a conformer analysis of the from each conformer. conformational ensemble obtained by this Kabsch ‘‘filtering’’ in Fig. 2. We have highlighted in grey those regions of the (w1,w2) plot 5. Weighting and optimisation in Fig. 2 that correspond to the in vacuo minima employed by Deplazes and coworkers,47 301.ConcerningtheHis1+ confor- The optimisation of solvated systems is notoriously difficult owing

mers, the (w2 = 601, w1 =601) region is not sampled at all. Analysis to the absence of well-defined minima on the PES. If the PES showsthisfacttoresultfromanunfavourableammonium– is significantly undulant, conventional minimisation algorithms 0 imidazole NpH contact. Concerning the His conformers, when struggle, as low order spatial derivatives do not suffice in describing 49 w2 =901, a favourable ammonium–imidazole Np interaction the local topology of the potential energy surface. One is frequently 0

Creative Commons Attribution 3.0 Unported Licence. stabilises the His [NtH] microstate. The poor sampling of the reduced to sequentially optimising the system at a number of 0 w1 = 601 region for the His [NtH] microstate arises from increasingly complex levels of theory to obtain convergence of an inability to form a stabilising carboxylate–imidazole NpH the maximum atomic forces and displacements. Whilst tedious, 0 interaction. In contrast, the protonated NpH of the His [NpH] it is also no guarantee of convergence, as we now outline. and His1+ microstates allow for this stabilising interaction. To circumvent having to compute (time-consuming) second Importantly, we see the significant levels of sampling in the order spatial derivatives of the energy at every step of the geometry 50,51 w1 = 1801 regions for all ensembles. These conformations optimisation, updating methods are typically invoked correspond to an extended imidazole group, where no by commercially available quantum chemistry packages. This article is licensed under a Open Access Article. Published on 12 September 2016. Downloaded 21/11/2016 20:58:12.

Fig. 2 Sidechain dihedral values (w1 and w2) taken by conformers within the three ensembles obtained by a Kabsch filtering of the MD trajectories 1+ 0 0 outlined in Section 3.2. His (blue circles), His [NpH] (green diamonds) and His [NtH] (red triangles). Grey hatched regions correspond to those conformers used in ref. 47.

This journal is © the Owner Societies 2016 Phys. Chem. Chem. Phys., 2016, 18, 27377--27389 | 27381 View Article Online

Paper PCCP

Whilst one can in principle explicitly compute the second order 5.1 Boltzmann weighting spatial derivatives of the energy at each optimisation step, for If the solute is optimised in the field of the unoptimised solvent, the majority of systems this is not a feasible approach. As such, we query whether it is appropriate to use the energy of the entire the Hessian is explicitly computed at the start of the geometry system for Boltzmann weighting the conformation? We make optimisation, and subsequently updated as a function of the the (reasonable) assumption that the dominant spectral features displacement of the system from this initial geometry by use arise from the solute, particularly in the higher wavenumber of first order derivative information. By approximating the regions, where librational modes do not feature. Consider the Hessian at each step of the geometry optimisation, one can effects of optimising the solute while freezing the solvent: the save a great deal of time. Indeed, such Hessian updating solute will adopt an energetically preferable conformation, while schemes have proven to be very accurate for a number of the solvent remains unoptimised. The Boltzmann weight of 52,53 systems. the solute–solvent system could be relatively low owing to a However, we consider now the case where a complex system particularly unfavourable solvent conformation. However, if the begins in a conformation that is far from the energetic solute is an a comparatively low energy conformation, the low minimum. Given that the potential energy surface is highly Boltzmann weight is not a proper reflection of the contribution undulant, the Hessian updating scheme leads to increasingly of the conformer’s spectrum to the overall spectrum since the poor approximations to the Hessian as the optimisation pro- solute dictates the major features of the spectrum. ceeds further from the initial geometry. As such, the geometry In light of this argument, we propose two means for optimisation fails to converge. One is then required to adopt Boltzmann weighting the spectrum of a given conformer. The some intermediate approach, where the Hessian is explicitly first takes the energy of the entire solute–solvent system calculated, updated for a number of time steps, followed by (‘‘solute–solvent-weighting’’), while the second takes the energy of explicit recomputation. The frequency with which one performs just the optimised solute, without the solvent (‘‘solute-weighting’’). the explicit recomputation of the Hessian is then a matter for The Boltzmann weights of the conformers from the two

Creative Commons Attribution 3.0 Unported Licence. striking a balance between efficiency and accuracy. schemes are compared with one another in Table 1. An alternative approach to optimising the entire system has From Table 1, we see that for both Boltzmann weighting 31 been alluded to in Section 2.2. In the work of Zielinski et al. schemes, the same six conformers dominate the resultant 30 and Mutter et al., two levels of optimisation were considered: spectrum, albeit in different proportions. The first conformer is OptAll and OptSolute. The former optimises the entire system, dominant with solute–solvent-weighting, while the Boltzmann and the latter optimises only the solute in the field of the weight of the first conformer is roughly two times lower with unoptimised solvent. Noting that, in these studies, the optimi- solute-weighting. We can conclude that, in this case (and we have sations were performed on snapshots from MD trajectories, the no reason to suspect that the result would not be general), system is initially far from the minimum energy conformation, both weighting schemes yield the same dominating conformers, This article is licensed under a rendering the optimisation all the more difficult. Hence, if one but their contributions to the resulting spectrum differ rather is able to ‘‘get away with’’ a simpler optimisation, the savings in significantly. To evaluate whether the resultant spectra are computational time will be substantial. affected by the differing Boltzmann weights, we present the The reasonable agreement of OptSolute with OptAll in the Open Access Article. Published on 12 September 2016. Downloaded 21/11/2016 20:58:12. Raman and ROA spectra generated from the two schemes, in referenced works indicates that one does not necessarily require addition to the experimental spectra, in Fig. 3 and 4, respectively. the entire system to be optimised to obtain informative spectra. From the Raman spectra in Fig. 3, there appear to be no Considering the solute–solvent system as a whole, the OptSolute pronounced differences between the two weighting schemes. methodology calculates the spectra of unoptimised systems; A number of subtle exceptions exist, but certainly none that technically speaking, it is only a subsystem (in this case the render one spectrum superior to the other when compared with solute) that is optimised. This then begs the question of the the experimental spectrum. From the ROA spectra in Fig. 4, we value of energetic minima for these calculations, and whether see that the two Boltzmann weighting schemes yield spectra strict geometry optimisation is an unnecessary burden. that are again quite similar to the experimental spectrum. This The OptSolute model is appealing owing to the reduced similarity suggests that the Boltzmann weighting scheme does computational complexity of the resultant optimisation. How- not massively impact on the spectra. However, there are a ever, its use raises three pertinent questions: (1) How should the individual conformer spectra be Boltzmann weighted?; (2) If the OptSolute scheme is valid, can we simplify Table 1 Conformer Boltzmann weights exceeding 1% for the 35 micro- the calculations further by adopting an even less rigorous solvated histidine systems using the system-weighting and solute-weighting schemes. The percentages sum to 100% for both weights optimisation criterion?; and (3) Are alternative subsystem optimisation schemes viable? We deal with, and elaborate Conformer number Solute–solvent weight (%) Solute weight (%) upon, each of these questions in turn in the following sections. 1 31.6 14.7 As our test case, we take microsolvated zwitterionic histidine 2 22.3 21.6 with five explicit water molecules. The conformers have been 11 6.7 16.3 25 20.1 20.6 taken from the snapshots obtained with the methodology out- 27 18.0 23.6 lined in Section 4. 35 1.3 3.2

27382 | Phys. Chem. Chem. Phys., 2016, 18, 27377--27389 This journal is © the Owner Societies 2016 View Article Online

PCCP Paper

number of slight differences between the two. The 1350–1450 cm1 region of the experimental spectrum is dominant, with a well- defined +ve/ve/+ve profile. We see that the first positive region is recovered by both of our calculated spectra, but is arguably better modelled with the solute-weighting. Indeed, it is interesting to note that this first peak is slightly blue-shifted relative to the experimental spectrum for the solute–solvent-weighting, whereas with the solute-weighting, this peak coincides quite well with the experimental spectrum. The subsequent ve/+ve profile is present and well-recovered by both, but again the final positive peak is better recovered with the solute-weighting. From the results we have given, we have concluded that the difference between the two different Boltzmann weighting schemes is small. In the absence of a definitive result, we have chosen to use the solute-weighting scheme owing to the small improvements we have alluded to in the 1350–1450 cm1 region. We also believe that considering only solute energies makes intuitive sense in the context of the optimisation schemes, as we have discussed.

5.2 Level of optimisation

Creative Commons Attribution 3.0 Unported Licence. We now assess whether we can circumvent the need to perform stringent geometry optimisations on our system. We have used three levels of optimisation for the comparison: (i) no optimi- Fig. 3 Raman spectra for the solute–solvent weighting and solute- sation (conformers taken straight from the MD snapshots), weighting schemes. The experimental Raman spectrum is also shown. (ii) the ‘‘Loose’’ (Maximum force r2.5 103 a.u.; Maximum displacement r1.0 102 Å), and (iii) ‘‘Regular’’ (Maximum force r4.5 104 a.u.; Maximum displacement r1.8 103 Å) optimisation schemes available in the GAUSSIAN09 package. The resultant spectra are presented in Fig. 5 and 6. A fourth level This article is licensed under a of optimisation, ‘‘Tight’’ (Maximum force r1.5 105 a.u.; Maximum displacement 6.0 105 Å), was also attempted, but the system proved difficult to converge, and so we have not pursued this level. Open Access Article. Published on 12 September 2016. Downloaded 21/11/2016 20:58:12. The Raman spectra in Fig. 5 obtained from the completely unoptimised conformers differ significantly from the other spectra formed from optimised conformers. This poor modelling is indicative of the fact that the MD snapshots are very far from energetic minima, and so are not representative of the conformational ensemble of the system. The ultimate arbiter is the comparison of the calculated spectrum with the experimental spectrum, and we see that agreement is poor between the unoptimised and experimental spectra. There appear to be no features of the experimental spectrum that are reproduced by the unoptimised spectrum. Turning our attention to the optimised spectra, we see that the Raman spectra are very similar to one another, and reproduce a number of the features present in the experimental spectrum. The series of strong peaks in the 1200–1600 cm1 region are largely accounted for in our computed spectra. However, we are unable to characterise one level of optimisation as superior to the other from the Raman spectra. The ROA spectra given in Fig. 6 do, however, differ to some 1 Fig. 4 ROA spectra for the solute–solvent weighting and solute- degree. Surprisingly, the spectral features in the 800–1000 cm weighting schemes. The experimental ROA spectrum is also given. window appear to be poorly characterised by the regular

This journal is © the Owner Societies 2016 Phys. Chem. Chem. Phys., 2016, 18, 27377--27389 | 27383 View Article Online

Paper PCCP

optimisation spectrum, but are relatively strong in the loose optimisation spectrum. It could be that the amplitude of the peak at roughly 1100 cm1 dwarfs these features in the regular optimisation. We note that this peak is not particularly strong in the experimental spectrum. Assessing the 1350–1450 cm1 region, both the loose and regular optimisation schemes repro- duce the +ve/ve/+ve signal, with there being no notable difference between their qualities relative to the experimental spectrum. However, the signal appears to be red-shifted by roughly 50 cm1 with the loose optimisation relative to both the regular optimisation and experimental spectra. Based on these results, we have deemed the loose optimisa- tion to be sufficient in reproducing the spectral features of the experimental spectrum. The regular optimisation offers little improvement in the calculated spectra relative to the loose optimisation. We do not consider the red-shifting of the loose optimisation relative to the experimental spectrum to be detrimental, particularly since the spectral features are still well-recovered. Considering the convergence level is roughly an order of magnitude more stringent for the regular optimisa- tion, it seems to be an unnecessary additional effort to optimise the systems so thoroughly. Indeed, the loose optimisation

Creative Commons Attribution 3.0 Unported Licence. requires roughly half the amount of time to compute relative to the regular optimisation for this system. Ensuring the system is at least somewhat close to an energetic minimum appears to Fig. 5 Raman spectra for the unoptimised, loosely optimised and regularly be sufficient for computing spectra. We reiterate that if the optimised conformers. The experimental Raman spectrum is also given. system is far from an energetic minimum, as we understand the unoptimised MD snapshots to be, one obtains extremely poor computed spectra.

5.3 Alternative optimisation schemes

This article is licensed under a Given that the optimisation of a subsystem leads to informative spectra, the final point we wish to address is whether alter- native subsystem optimisation schemes yield equally informa- tive spectra. The OptAll scheme has been found to produce the Open Access Article. Published on 12 September 2016. Downloaded 21/11/2016 20:58:12. highest quality spectra, and so this is the case we use as a benchmark. However, we can also formulate several alternative subsystem optimisation schemes: (a) OptSolute The solute is optimised in the field of the unoptimised solvent. We have already alluded to this approach and its adoption in previous work, yielding spectra that are comparable to OptAll. (b) OptSolvent TheconverseapproachtoOptSolute,wherethesolventis optimisedinthefieldoftheunoptimisedsolute.Thisapproach is potentially all the more appealing, since the optimisation takes place in the low layer of the QM/MM, and is therefore faster. (c) OptSolvent - OptSolute A ‘‘two-step’’ subsystem optimisation scheme – the OptSolvent methodology is invoked, followed by OptSolute. By optimising the layers separately, we hope that we can recreate the optimi- sation resulting from OptAll. (d) OptSolute - OptSolvent The OptSolute methodology is invoked, followed by OptSolvent. Fig. 6 ROA spectra for the unoptimised, loosely optimised and regularly We do not imagine this to differ from the previous approach, optimised conformers. The experimental ROA spectrum is also given. but is included for the sake of completeness.

27384 | Phys. Chem. Chem. Phys., 2016, 18, 27377--27389 This journal is © the Owner Societies 2016 View Article Online

PCCP Paper

(e) OptAll does not possess the same relative intensity as that in the The entire system is optimised simultaneously. experimental spectrum. The best agreement with the experi- The first two methods, (a) and (b), are concerned with mental spectrum is arguably (c), which appears to recover the forsaking accuracy for a large decrease in computational cost. vast majority of spectral features featured in (e). We expect these methods to yield spectra of poorer quality Turning our attention to the ROA spectra of Fig. 8, we notice than OptAll, but to be far quicker to compute. Contrastingly, that the poor performance of (b) is continued, and virtually no the last two methods, (c) and (d), look to replicate the quality of experimental features are recovered. We also draw attention to spectrum obtained by the OptAll scheme, while saving some the poor performance of (d) in the low wavenumber regions. In level of computational time in the process. In the following, we turn, this poor modelling of the low wavenumber regions will refer to each optimisation scheme by the letter it has been obscures the finer details of the high wavenumber regions, listed with in the above enumeration. For example, when and the spectrum as a whole deteriorates. Similarly, (a) appears referring to OptAll, we will use (e). to be a poor approximation to the experimental spectrum; the From the Raman spectra presented in Fig. 7, we immediately coarse details of the spectrum appear to be present, but the high see that (b) performs particularly poorly. All three amide I–III wavenumber regions are poorly modelled. In contrast, (c) is in regions (1650 cm1, 1550 cm1 and 1300 cm1, respectively) are excellent agreement with both the spectra obtained through poorly represented, and the intensity of the low wavenumber experiment and (e). With the exception of a few features in the regions is not in agreement with the experimental spectrum. low wavenumber regions, (c) and (e) are almost identical. However, the other four optimisation schemes yield spectra that Offering some explanation for the results we have obtained, are in good agreement with the experimental spectrum. The the optimisation of the solute appears to be the key factor in amide I–III regions are well-recovered, while the low wavenumber guaranteeing the quality of computed spectra; the poor results regions are modelled equally well. If we are to be fastidious, we from (a) and (c) presumably derive from the unoptimised state may question the amide I regions of (a) and (d), where the signal of the solute when the spectrum is calculated. To recreate the

Creative Commons Attribution 3.0 Unported Licence. finer details of the spectrum, it also appears as if the solvent requires some level of optimisation, particularly with the ROA measurements. Hence, this is why (c) significantly outperforms (a). The Raman spectra seem to be robust, or mainly invariant, with regards to the level of solvent optimisation. In conclusion, we feel that (c) and (e) offer the best recovery of experimental spectral features. However, we have found that the computational cost associated with (c) is reduced by more than fourfold on average relative to (e). Relative to the compu- This article is licensed under a tational cost associated with (a), (c) is roughly 50% more expensive computationally. Therefore, it seems as if the OptSolvent - OptSolute or (c) methodology strikes an ideal balance between accuracy and computational time. We proceed Open Access Article. Published on 12 September 2016. Downloaded 21/11/2016 20:58:12. with this methodology through the remainder of this work.

6. Microsolvation

We define microsolvation to be the explicit solvation of a system using a small number of water molecules, typically not extending beyond the second solvation shell of the solute. The major question we wish to answer is whether computed spectra from microsolvated systems are able to model experi- mental spectra, without having to resort to performing calcula- tions on large explicitly solvated systems. If we can answer this query positively, then it facilitates the computation of solvated Raman and ROA spectra.

6.1 Neutral histidine For the Raman spectra in Fig. 9, we find that the amide I band at around 1600 cm1 is well recovered by each of the micro-

Fig. 7 Raman spectra for (a) OptSolute, (b) OptSolvent, (c) OptSolvent - solvated systems. Since the amide I band is largely indicative OptSolute, (d) OptSolute - OptSolvent and (e) OptAll schemes. The of peptide CQO stretching modes, it is initially surprising experimental Raman spectrum is also given. that its modelling appears to be independent of the level of

This journal is © the Owner Societies 2016 Phys. Chem. Chem. Phys., 2016, 18, 27377--27389 | 27385 View Article Online

Paper PCCP Creative Commons Attribution 3.0 Unported Licence.

Fig. 8 ROA spectra for (a) OptSolute, (b) OptSolvent, (c) OptSolvent - OptSolute, (d) OptSolute - OptSolvent and (e) OptAll schemes. The

This article is licensed under a experimental ROA spectrum is also given. Fig. 9 Raman spectra for the various levels of microsolvation of His0. The experimental Raman spectrum is also shown for comparison. microsolvation, where solvent interactions presumably damp the stretching mode. However, upon closer inspection of the Open Access Article. Published on 12 September 2016. Downloaded 21/11/2016 20:58:12. microsolvated conformers used for the spectra, we find that the We note that we do not manage to definitively recover the clear peptide CQO is solvated in each conformer, rendering its quadruplet seen in the experimental spectrum, although the profile modelling independent of the degree of microsolvation. Simi- does seem to become more enhanced as we increase the degree of larly, we recover the amide II triplet at B1450–1550 cm1, microsolvation. Indeed, when 20 water molecules are included, we albeit blue-shifted relative to the experimental spectrum. The are able to discern four clear peaks in the amide III region. triplet signature appears to be most profound with a solvation Regarding the impact of the tautomeric forms of histidine, shell comprising 15 water molecules, where the central peak is Ashikawa and Itoh54 have found that the breathing motions for 0 0 dominant. This result is to be expected since a solvation shell of His [NpH] and His [NtH] can be related to Raman peaks at 15 water molecules appears to coincide with the number of 1304/1260 cm1, respectively. In the same work, two additional water molecules comprising the first solvation shell for zwitter- marker regions characterising imidazole stretching modes between ionic histidine. Since the amide II region is representative of the tautomeric forms of histidine were proposed: 1568/1585 cm1 peptide N–H bending and C–H stretching modes, we expect and 1090/1105 cm1.LaterworkbyToyamaet al.55 has suggested this region to be highly dependent on solvent interactions. It is, an additional marker region at 1320/1354 cm1. Unfortunately, nevertheless, surprising that the quality of our computed the majority of these marker regions inhabit the strong amide spectra deteriorates with 20 water molecules. regions of the Raman spectrum, making them difficult to discern. The amide III region is slightly less well-modelled by our However, the 1090/1105 cm1 region occupies a low intensity microsolvated systems. In the experimental spectrum, we observe region of the Raman spectrum, and appears to be characterised in a well-defined quadruplet of peaks, spanning 1200–1350 cm1. both our experimental and computed spectra, at each microsolva- This region is dominated by C–N stretching and N–H tion level excluding the spectrum with 5 waters. bending modes, and we would expect this region to be as The experimental ROA spectrum is not quite as well- solvation-dependent as the other two regions we have discussed. recovered as the microsolvated spectra, as depicted in Fig. 10.

27386 | Phys. Chem. Chem. Phys., 2016, 18, 27377--27389 This journal is © the Owner Societies 2016 View Article Online

PCCP Paper

molecules in the microsolvated systems, and so it is to be expected that the alkyl sidechain be poorly solvated, and so not well recovered by the microsolvated spectra.

6.2 Cationic histidine The literature regarding the calculation of Raman and ROA spectra of formally charged species is sparse. Indeed, the ab initio modelling of formally charged species is in itself difficult, since the basis sets require both diffuse functions and some description of polarisation, the former being of particular importance for anions. To assess the quality of Raman and ROA spectra of a formally charged species, we have selected cationic histidine. The fact that no tautomers exist for this species makes its modelling simpler. A formally charged species is presumed to undergo stronger interactions with solvent than the neutral species, and so we hypothesise that the degree of microsolvation will significantly influence the quality of the computed spectrum. Since the formal charge results from the protonation of the imidazole ring, we anticipate that the higher wavenumber regions will be poorly reproduced at the low levels of microsolvation since the explicit water molecules tend to primarily interact with the zwitterionic groups.

Creative Commons Attribution 3.0 Unported Licence. The computed Raman spectra of Fig. 11 are in relatively good agreement with the experimental Raman spectrum. The strong peaks at B1200 cm1 and B1300 cm1 become increasingly prominent as the degree of microsolvation increases. The peak at B1500 cm1 is well-recovered by each of the microsolvated spectra. The resolution of the doublet of peaks at roughly 800 cm1 is similarly improved as the degree of microsolvation increases. Interestingly, the general agreement with the experimental spectrum deteriorates when 10 explicit water molecules are This article is licensed under a Fig. 10 ROA spectra for the various levels of microsolvation of His0.The used for microsolvation. However, those spectra including 5, 15 experimental ROA spectrum is also shown for comparison. and 20 explicit water molecules are very similar. We offer a potential reason for the deterioration in spectrum quality when 10 explicit water molecules are included. When 5 Open Access Article. Published on 12 September 2016. Downloaded 21/11/2016 20:58:12. This is not to say that the ROA spectra are of poor quality, but explicit water molecules are included, the conformers are simply suffer by comparison with the high quality of the Raman largely solvated around the backbone amide and carboxyl spectra. The dominant feature across both experimental and groups of the zwitterionic histidine. When 15 and 20 explicit calculated spectra is the +ve/ve/+ve signal at 1300–1500 cm1, water molecules are included, the conformers are completely and is well-recovered at all levels of microsolvation. However, solvated, i.e. both the backbone and sidechain groups of the it is worth pointing out that the calculated spectra appear to be zwitterionic histidine. However, when 10 explicit water mole- blue-shifted by B50 cm1 relative to the experimental spectrum. cules are included, only a couple of water molecules solvate the This blue-shifting is somewhat surprising. Note that we have not imidazole ring of the zwitterionic histidine. As such, we suggest undertaken abscissa-scaling by 0.96, as prescribed by Radom.56 that the partial solvation of the imidazole results in a biasing of Doing so would blue-shift the spectrum further, and so would certain vibrational frequencies originating from the imidazole– result in the deterioration of our calculated spectra. solvent interactions. The spectrum as a whole deteriorates We note that as the level of microsolvation increases, the peak since these imidazole–solvent frequencies then dominate the at 1050 cm1 is amplified. A number of peaks exist in this region resultant spectrum. In summary, it is tempting to think of of the experimental spectrum, but none are as prominent as that partial solvation as detrimental to the correct prediction of the in the microsolvated spectrum. Assessing the normal modes of vibrational spectra. motion around this region, we find that the majority are domi- The computed ROA spectra in Fig. 12 are, however, not quite nated by sidechain dihedral torsional degrees of freedom. This as successful in reproducing the major features of the experi- observation then suggests that the sidechain torsional degrees of mental spectrum. The strong negative band at B1400 cm1 is freedom are more flexible in the microsolvated systems than in recovered in each of the computed spectra, and the positive the experiment. Indeed, it is predominantly the zwitterionic doublet in the 1300–1350 cm1 region appears to become groups and imidazole ring that form interactions with the solvent clearer as the level of microsolvation increases.

This journal is © the Owner Societies 2016 Phys. Chem. Chem. Phys., 2016, 18, 27377--27389 | 27387 View Article Online

Paper PCCP Creative Commons Attribution 3.0 Unported Licence. This article is licensed under a Fig. 11 Raman spectra for the various levels of microsolvation of His1+. Fig. 12 ROA spectra for the various levels of microsolvation of His1+. The The experimental Raman spectrum is also shown for comparison. experimental ROA spectrum is also shown for comparison.

Open Access Article. Published on 12 September 2016. Downloaded 21/11/2016 20:58:12. From previous, unpublished work on anionic adenosine We are suggestively vague in our use of the term ‘‘solvated triphosphate, we have found that the calculation of ROA spectra systems’’, since we see no reason why our methodology cannot for formally charged species is rather difficult. One speculative be applied to systems outside of the domain of peptides, reason for the poorer quality of the ROA spectra relative to the such as DNA bases.57,58 A novel conformational sampling Raman spectra is the way one computes the intensity of spectral methodology allows for the unambiguous extraction of a set bands for the two spectroscopies. For Raman spectra, the intensity of mutually diverse conformers from an MD trajectory of is given by the sum of the intensities of the right- and left-circularly arbitrary length. We have shown that conformers do not polarised scattered light. For ROA spectra, the intensity is given by require strict optimisation to obtain high quality spectra, and the difference in the intensities of the right- and left-circularly that a two-step subsystem optimisation (i.e. optimising the polarised scattered light. Therefore, ROA band intensities are more solvent while keeping the solute frozen, and subsequently prone to error than the Raman band intensities. In other words, the optimising the solute while keeping the solvent frozen) yields ROA band intensities are smaller in magnitude than the Raman spectra of the same quality as entire system optimisations. band intensities, and so the errors are amplified. It is this sensitivity The microsolvation of both formally charged and neutral to error that makes ROA a ‘‘gold-standard’’ spectroscopic techni- zwitterionic histidine species can yield calculated spectra that que, whereas Raman is somewhat more forgiving. converge towards the experimental spectra. 7. Conclusion Acknowledgements We have investigated and presented the results for a number of novel methodologies that can be used in the calculation S. C. and M. G. L. thank the BBSRC for awarding DTP student- of ab initio Raman and ROA spectra of solvated systems. ships and P. L. A. P. the EPSRC for the award of an Established

27388 | Phys. Chem. Chem. Phys., 2016, 18, 27377--27389 This journal is © the Owner Societies 2016 View Article Online

PCCP Paper

Career Fellowship (EP/K005472). All authors are grateful to 30 S. T. Mutter, F. Zielinski, J. R. Cheeseman, C. Johannessen, Professor P. Bourˇ for access to his ROA spectrometer for P. L. Popelier and E. W. Blanch, Phys. Chem. Chem. Phys., performing the experimental measurements. 2015, 17, 6016. 31 F. Zielinski, S. T. Mutter, C. Johannessen, E. W. Blanch and P. L. Popelier, Phys. Chem. Chem. Phys., 2015, 17, 21799. References 32 S. T. Mutter, F. Zielinski, P. L. Popelier and E. W. Blanch, Analyst, 2015, 140, 2944. 1 L. D. Barron, J. Chem. Soc., Perkin Trans. 2, 1977, 1074. 33 B. Hess, C. Kutzner, D. Van Der Spoel and E. Lindahl, 2 L. Barron and A. Buckingham, J. Am. Chem. Soc., 1979, 101, 1979. J. Chem. Theory Comput., 2008, 4, 435. 3 L. D. Barron and J. Vrbancich, J. Chem. Soc., Chem. Commun., 34 G. A. Kaminski, R. A. Friesner, J. Tirado-Rives and 1981, 771. W. L. Jorgensen, J. Phys. Chem. B, 2001, 105, 6474. 4 O. V. Shishkin, A. Pelmenschikov, D. M. Hovorun and 35 G.B.Arfken,H.J.WeberandF.E.Harris,Mathematical methods J. Leszczynski, Chem. Phys., 2000, 260, 317. for physicists: a comprehensive guide,AcademicPress,2011. 5 L. Hecht, A. L. Phillips and L. D. Barron, J. Raman Spectrosc., 36 W. Kabsch, Acta Crystallogr., Sect. A: Cryst. Phys., Diffr., 1995, 26, 727. Theor. Gen. Crystallogr., 1976, 32, 922. 6 L. Barron, L. Hecht, E. Blanch and A. Bell, Prog. Biophys. 37 W. Kabsch, Acta Crystallogr., Sect. A: Cryst. Phys., Diffr., Biophys. Chem., 2000, 73,1. Theor. Gen. Crystallogr., 1978, 34, 827. 7 E. W. Blanch, L. Hecht and L. D. Barron, Methods, 2003, 29, 196. 38 E. A. Coutsias, C. Seok and K. A. Dill, J. Comput. Chem., 8 A. Bell, L. Hecht and L. Barron, J. Am. Chem. Soc., 1998, 120, 5820. 2004, 25, 1849. 9 Z. Shi, R. W. Woody and N. R. Kallenbach, Adv. Protein 39 S. Wallin, J. Farwer and U. Bastolla, Proteins: Struct., Funct., Chem., 2002, 62, 163. Bioinf., 2003, 50, 144. 10 E. W. Blanch, L. Hecht, C. D. Syme, V. Volpetti, G. P. 40 J. B. Ghosh, Oper. Res. Lett., 1996, 19, 175.

Creative Commons Attribution 3.0 Unported Licence. Lomonossoff, K. Nielsen and L. D. Barron, J. Gen. Virol., 41 M. Frisch, G. Trucks, H. B. Schlegel, G. Scuseria, M. Robb, 2002, 83, 2593. J. Cheeseman, G. Scalmani, V. Barone, B. Mennucci, 11 L. Hecht, L. D. Barron, E. W. Blanch, A. F. Bell and L. A. Day, G. Petersson, et al., Gaussian 09 (Revision E.01), Gaussian J. Raman Spectrosc., 1999, 30, 815. Inc., Wallingford, CT, 2009, p. 200. 12 M. J. Betts and R. B. Russell, Bioinf. Genet., 2003, 317, 289. 42 T. Vreven, K. S. Byun, I. Koma´romi, S. Dapprich, 13 C. D. Churchill and S. D. Wetmore, J. Phys. Chem. B, 2009, J. A. Montgomery Jr, K. Morokuma and M. J. Frisch, 113, 16046. J. Chem. Theory Comput., 2006, 2, 815. 14 P. Carter and J. A. Wells, Nature, 1988, 332, 564. 43 I. Cukrowski and C. F. Matta, Chem. Phys. Lett., 2010, 15 S. Manzetti, D. R. McCulloch, A. C. Herington and D. van 499, 66.

This article is licensed under a der Spoel, J. Comput.-Aided Mol. Des., 2003, 17, 551. 44 K. Ruud and A. J. Thorvaldsen, Chirality, 2009, 21, E54. 16 S. Nobili, E. Mini, I. Landini, C. Gabbiani, A. Casini and 45 J. R. Cheeseman and M. J. Frisch, J. Chem. Theory Comput., L. Messori, Med. Res. Rev., 2010, 30, 550. 2011, 7, 3323. 17 P. Huda´ky and A. Perczel, J. Phys. Chem. A, 2004, 108, 6195. 46 G. Zuber and W. Hug, J. Phys. Chem. A, 2004, 108, 2108. Open Access Article. Published on 12 September 2016. Downloaded 21/11/2016 20:58:12. 18 K. J. Jalkanen, R. Nieminen, M. Knapp-Mohammady and 47 E. Deplazes, W. Van Bronswijk, F. Zhu, L. Barron, S. Ma, S. Suhai, Int. J. Quantum Chem., 2003, 92, 239. L. Nafie and K. Jalkanen, Theor. Chem. Acc., 2008, 119, 155. 19 E. Tajkhorshid, K. Jalkanen and S. Suhai, J. Phys. Chem. B, 48 S. Ling, W. Yu, Z. Huang, Z. Lin, M. Haran˜czyk and 1998, 102, 5899. M. Gutowski, J. Phys. Chem. A, 2006, 110, 12282. 20 C. J. Cramer and D. G. Truhlar, Chem. Rev., 1999, 99, 2161. 49 A. B. Birkholz and H. B. Schlegel, Theor. Chem. Acc., 2016, 135,1. 21 B. Mennucci, Wiley Interdiscip. Rev.: Comput. Mol. Sci., 2012, 50 V. Bakken, J. M. Millam and H. B. Schlegel, J. Chem. Phys., 2, 386. 1999, 111, 8773. 22 A. Klamt and G. Schu¨u¨rmann, J. Chem. Soc., Perkin Trans. 2, 51 C. Peng, P. Y. Ayala, H. B. Schlegel and M. J. Frisch, 1993, 799. J. Comput. Chem., 1996, 17, 49. 23 V. Barone and M. Cossi, J. Phys. Chem. A, 1998, 102, 1995. 52 S. K. Lee, W. Li and H. B. Schlegel, Chem. Phys. Lett., 2012, 24 O. h. O. Brovarets’ and D. M. Hovorun, J. Biomol. Struct. 536, 14. Dyn., 2015, 33, 28. 53 A. B. Birkholz and H. B. Schlegel, J. Comput. Chem., 2015, 25 O. B. Ol’ha, R. O. Zhurakivsky and D. M. Hovorun, J. Mol. 36, 1157. Model., 2013, 19, 4119. 54 I. Ashikawa and K. Itoh, Biopolymers, 1979, 18, 1859. 26 K. Jalkanen, I. Degtyarenko, R. Nieminen, X. Cao, L. Nafie, 55 A. Toyama, K. Ono, S. Hashimoto and H. Takeuchi, J. Phys. F. Zhu and L. Barron, Theor. Chem. Acc., 2008, 119, 191. Chem. A, 2002, 106, 3403. 27 S. Luber and M. Reiher, J. Phys. Chem. A, 2009, 113, 8268. 56 A. P. Scott and L. Radom, J. Phys. Chem., 1996, 100, 16502. 28 K. H. Hopmann, K. Ruud, M. Pecul, A. Kudelski, 57 O. h. O. Brovarets, R. O. Zhurakivsky and D. M. Hovorun, M. Dracˇ´ınsky´ and P. Bour, J. Phys. Chem. B, 2011, 115, 4128. J. Comput. Chem., 2014, 35, 451. 29 J. R. Cheeseman, M. S. Shaik, P. L. Popelier and 58 O. h. O. Brovarets and D. M. Hovorun, Phys. Chem. Chem. E. W. Blanch, J. Am. Chem. Soc., 2011, 133, 4991. Phys., 2014, 16, 15886.

This journal is © the Owner Societies 2016 Phys. Chem. Chem. Phys., 2016, 18, 27377--27389 | 27389