Recommending Hartree-Fock Theory with London-

Dispersion and Basis-Set-Superposition Corrections for the Optimization or Quantum Refinement of

Protein Structures

Lars Goerigk,a* Charles A. Collyer,b Jeffrey R. Reimersc,d* a School of Chemistry, The University of Melbourne, Parkville, Victoria 3010, Australia b School of Molecular Bioscience, The University of Sydney, Sydney, New South Wales 2006,

Australia c School of Physics and Advanced Materials, The University of Technology, Sydney, NSW

2007, Australia d Centre for Quantum and Molecular Structure, College of Sciences, Shanghai University,

Shanghai 200444, China

1 ABSTRACT We demonstrate the importance of properly accounting for London-dispersion and basis-set superposition-error (BSSE) in quantum-chemical optimizations of protein structures, factors that are often still neglected in contemporary applications. We optimize a portion of an ensemble of conformationally flexible lysozyme structures obtained from highly accurate X-ray crystallography data that serves as a reliable benchmark. We not only analyze root-mean-square deviations from the experimental Cartesian coordinates, but, for the first time, also demonstrate how London-dispersion and BSSE influence crystallographic R factors. Our conclusions parallel recent recommendations for the optimization of small gas-phase peptide structures made by some of the present authors: Hartree-Fock theory extended with Grimme’s recent dispersion and

BSSE corrections (HF-D3-gCP) is superior to popular density-functional-theory (DFT) approaches. Not only are statistical errors on average lower with HF-D3-gCP, but also its convergence behavior is much better. In particular, we show that the BP86/6-31G* approach should not be relied upon as a black-box method, despite its widespread use, as its success is based on an unpredictable cancellation of errors. Using HF-D3-GCP is technically straightforward and we therefore encourage users of quantum-chemical methods to adopt this approach in future applications.

KEYWORDS protein structures, Hartree-Fock, London-dispersion, basis-set-superposition error, quantum refinement

2 1. Introduction

With the advent of better computer architectures and improved methodologies, quantum- mechanical (QM) treatments of biomolecular structures, such as proteins, have become increasingly feasible. Most calculations of this type are usually carried out in a hybrid fashion with the region of interest, for instance the active center of an , being treated quantum- mechanically, while the remainder is described with a molecular-mechanics force field

(QM/MM).1-33 Additionally, studies have also been conducted that discuss the semi-empirical34-37 and full QM treatment38-43 of entire proteins or of large protein substructures containing 1,000-

150,000 unique atoms. One way of achieving such full QM treatments is to break down the large structures into smaller fragments that are separately treated and then recombined. This is for example done in the fragment-molecular-orbital (FMO),44-49 divide-and-conquer (DC)38, 50-54 and in other linear-scaling approaches.55-62 Another way for full QM treatments of biomolecules has been made possible by recently emerged codes allowing the treatment of complete structures on graphics-processing-units (GPUs).63-65

While QM and QM/MM calculations of proteins can be conducted to obtain answers for a range of chemical problems (see e.g. Reference 66 for a review), we dedicate our present study to structure optimization. Protein structure optimizations are often considered as the initial step preceding computational treatments of other properties, such as the calculation of excitation spectra39-40 or, more commonly, activation energies for reactions.67 However, they can also be the main interest in some studies, particularly in the context of crystallographic macro-molecular refinement.1 In crystallographic refinement — a key step in the determination of protein X-ray structures — the atomic coordinates and thermal parameters of the system are modified under consideration of the observed reflection intensities. Empirical geometrical constraints or

3 restraints are almost always used to aid the initial stages of refinement, but when ultrahigh resolution data is available then a final refinement stage can be completed without them. Use of empirical parameters carries the danger that the resulting structures may be flawed by incorporation of chemically non-real features, such as unlikely bond-lengths and bond-angles or even steric clashes between atoms or worse still, incorrect atomic assignments. Unconstrained final-stage refinement is likely to propagate any incorrect assignments because the real structure or ensemble of conformers is outside of the radius of convergence and therefore this approach is not a universal panacea. Moreover, empirical constraints can make the detection of rare and new structural features difficult. As a solution, Ryde and co-workers suggested usage of quantum- chemical methods as a remedy to this problem (“quantum refinement”).2 In fact, over the past decade many successful quantum-refinement studies have been carried out with QM/MM approaches proving that this idea can be a valuable addition to standard refinement techniques.3-

10

Before considering methods for general protein-structure optimization or quantum refinement, we consider first our test system and its mandatory requirements. This system is based on an X-ray crystal structure of triclinic hen egg-white lysozyme (HEWL) for which data is observed to 0.65 Å resolution, making it one of the best resolved structures in the protein database (PDB).68 As this particular structure is observed at ultrahigh resolution and has been refined without any geometrical constrains, it serves as an ideal benchmark for the evaluation of

QM structure-optimization procedures. Most importantly, such high-resolution observations allow us to distinguish between single and multiple structural conformational elements. Such distinction is typically not possible in the refinement of low to medium resolution structures and the deduced atomic coordinates thus represent some sort of average over the internal chemical

4

Figure 1. Schematic depiction of the 2x2x2 supercell of lysozyme.69 Each distinct lysozyme molecule is labeled with letters A-H and contains different conformations.

complexity of the protein, with some coordinates not obeying the rules of chemistry. Any attempt at QM refinement of such coordinates would dramatically increase the R factor (a measure of the difference between calculated X-ray reflections and the observed ones),70-71 an important and discriminating metric when the aim is to actually refine a structure but a stalling point if, as in the present case, our aim is to optimize methods.

However, even unconstrained refinement is not able to detect most of the correlations that occur between different fluctuating structural elements within a protein and so there is no available method for the unambiguous mapping of the determined structural parameters of say

HEWL onto chemically reasonable global protein structures. Previously some of us have developed methods for construction of such a mapping and applied it to HEWL.69 This creates an ensemble, shown in Figure 1, of 8 chemically reasonable, fully interconnected, global protein structures that generates the same scattering R factor as the original PDB-deposited uncorrelated

5 structure. We use this structure to investigate QM methods for protein structure optimization and for quantum-refinement.

Intricate structure-function relationships make it a requirement that protein geometry optimizations be carried out with reliable and robust QM methods in order to obtain a meaningful basis for subsequent analysis. As already outlined, important progress has been made in the development of new hardware and of linear-scaling techniques that allow us to treat large systems quantum mechanically. In addition, important advances have also been made to the quality of available QM methods. However, we also notice that their rate of acceptance in the general community of “QM users” is low in the fields of protein optimization or quantum refinement.15-33 This finding is somewhat surprising, for it has been shown that newer QM methods may offer specific advantages over the more established ones. Most of the recent QM and QM/MM optimizations have been carried out with Kohn-Sham density functional theory

(DFT72-73) methods in combination with small atomic-orbital (AO) basis sets.15-29 Indeed, the vast majority of these applications used the BP8674-75 and B3LYP76-77 functionals with basis sets such as the 6-31G*78 Pople basis or the SV(P)79 and def2-SV(P)80 Ahlrichs basis sets. Some authors also used Hartree-Fock (HF) theory instead of DFT.30-33 However, those levels of theories can be problematic for two main reasons. Firstly, standard DFT approximations or HF do not account for important London-dispersion interactions. Secondly, small AO basis sets introduce errors from their incompleteness. We shortly review these two topics in the next paragraphs and relate them to the context of protein optimization.

The inability of standard DFT approximations to describe London-dispersion is now well-accepted81-84 and particularly the past ten years have seen immense effort to overcome this problem;85-99 we recommend References 100-102 for discussions of pros and cons of the

6 currently available methods. Of particular popularity are Grimme’s additive dispersion corrections DFT-D1,90 DFT-D2,91 and the newest DFT-D398-99 that all allow an efficient treatment of London-dispersion without diminishing the cost of the underlying DFT approximation. DFT-D3 has been tested for about 50 different density functionals and it is applicable to the first 94 elements of the periodic system. It has been shown to be accurate for non-covalently bound complexes,98-99, 103-105 conformers of biologically relevant compounds,99,

102, 106 protein- interaction energies,41 structural properties of small- and medium-sized systems,99, 107 and in the treatment of general thermochemical properties.108-109 Herein, we implement this approach only.

In 2012, we investigated the effects of London-dispersion corrections in a quantum- refinement framework for HEWL.110 We optimized a portion of this lysozyme ensemble (every even-numbered residue) with a divide-and-conquer algorithm originally introduced by Reimers and co-workers for a DFT optimization of the 150,000-atom-comprising photosystem-I (PSI) trimer in 2006.38 We tested nine different levels of theory and for each optimized structure considering the changes introduced to the R factor, reporting three main findings. Firstly, we noticed a basis-set dependence that parallels what is known from QM applications to small molecules: larger basis sets provided better structures (lower R factors) than smaller ones.

Secondly, and more importantly, we could show that the popular BP86/6-31G* approach gave the worst R factor of all tested methods and that it could be improved by applying the DFT-D3 dispersion correction. However, the third and maybe most surprising finding was that the best result was obtained for HF and Grimme’s D3 correction (HF-D3).

Even though the first additive dispersion corrections had originally been developed for

HF in the 1970s,111-112 using this type of approach for structure optimization – particularly when

7 combined with the more recent dispersion corrections – is not very common. Roskop et al. reported accurate structural properties with HF and Grimme’s older D2 dispersion correction in a model study of silica mesoporous molecular sieves.113 In 2011, HF-D3 has been shown to provide noncovalent interaction energies of merely general-gradient-approximation (GGA) DFT quality,108 but small-molecule structures with HF-D3 were more accurate than GGA-DFT.99 In

2014, Brandenburg and Grimme reported accurate structure prediction of organic molecular crystals with HF-D3.114 In the context of protein structures, HF-D3 has only been applied in our study from 2012110 and in a study of proteins on GPUs by Martinez and co-workers.63 Given the scarce number of HF-D3 applications, our findings for lysozyme make it worthwhile to investigate this approach further.

In addition to the neglect of dispersion corrections in many recent QM treatments of proteins, we also observe that due to large system sizes, small AO basis sets are usually applied.

Results with small finite basis sets can be influenced by two errors, the so-called basis-set- superposition error (BSSE) and the basis-set-incompleteness error (BSIE), which is a collective term for all other finite-basis-set related errors other than BSSE. BSSE can be best understood for the case of a noncovalently bound dimer. In this qualitative picture, the monomers are treated in the presence of the atomic basis functions of the respective opposite monomer. As a consequence their energies are evaluated in a larger space of AO functions than the separated monomers. In other words, the absolute energy of the dimer is artificially overstabilized compared to its monomers. This error is less pronounced for larger basis sets and vanishes in the complete-basis set (CBS) limit of an infinite basis set, however it can become sizeable for double-z basis sets such as 6-31G*. In the intermolecular case, the counterpoise correction of

Boys and Bernardi115 has become a popular remedy for BSSE, albeit its usage is still

8 controversial (see Reference 116 for a recent discussion). In the context of biomolecules one can also picture the occurrence of an intramolecular variant of the basis-set-superposition error

(IBSSE). One example would be an overestimation of inter-residue interactions in a folded polypeptide that leads to an overstabilization of its energy with respect to its unfolded conformer.

The effects of IBSSE on energetic properties of polypeptides has been discussed for example in

References 106 and 117, and its effects on structures of organic or small biomolecules in

References 118-121.

Various schemes to correct for IBSSE have been suggested,122-123 however it can be conceptually challenging to apply them in generalized routine applications because some of them can involve unchemical features, uncommon spin-states or covalent-bond cleavage. Furthermore, standard BSSE corrections can be costly making their application to large systems difficult. In

2012, Kruse and Grimme developed the geometrical counterpoise (gCP) correction that corrects for intra- and intermolecular BSSE on a conceptually equal footing.42 This correction is solely based on the atomic coordinates and does only take the atomic basis functions indirectly into account with a few empirical parameters. Therefore, it comes at nearly no additional cost and it can be applied to large systems. The gCP correction can be used in DFT and HF calculations and it has been parameterized for a few standard AO basis sets of mostly double-z quality. In their first study, Kruse and Grimme showed the technical usability of this approach in a minimal- basis-set optimization of the crambine protein. A short time later, Kruse et al. outlined how the popular B3LYP/6-31G* approach sometimes benefits from a fortuitous error cancellation between BSSE and a missing dispersion correction in the calculation of reaction energies and barrier heights of organic-chemical reactions.124 The authors also demonstrated that such error compensation cannot be predicted and that BSSE for B3LYP/6-31G* is significant.

9 Consequently, they outlined how the B3LYP-D3-gCP/6-31G* approach provided more robust and reliable results and they recommended that this approach should replace the standard methodology. The gCP correction has also been successfully tested for periodic systems and was shown to be useful in organic crystal-structure prediction.114, 125 Despite its potential it has to be noted that the gCP correction only corrects for BSSE and that other BSIE effects remain uncorrected. Sure and Grimme, who investigated the potential of HF-D3-gCP in minimal-basis- set applications, demonstrated this recently.43 Because of a large BSIE in such applications, the authors developed a third, HF-specific BSIE correction and dubbed the entire approach HF-3c.

In 2013, Goerigk and Reimers further investigated the potential of the D3 and gCP corrections for the structure optimization of biomolecules.126 In this context they benchmarked the corrections against 26 di- and tripeptide conformations, against a new set of cysteine-dimer conformers and against three-dimensional water-hexamer isomers. All structures can be seen as examples for typical motifs in protein crystal structures. Ab initio or experimental results served as reference in this study. It was found that both D3 and gCP were needed to provide balanced descriptions of the structural properties. Goerigk and Reimers argued that the success of BP86/6-

31G* for structure optimizations is due to an error compensation, similar to what Kruse et al. had shown for B3LYP/6-31G* and thermochemical properties. However, also cases were identified where such error compensation was not observed and therefore it was argued that both corrections should always be used in DFT and HF optimizations of similar systems to ensure higher robustness. Despite the analysis of intramolecular dispersion and BSSE effects, the study also analyzed the impact of the BSIE on structures and it was found to mostly affect lengths. However, BSIE had different effects depending on the underlying QM method.

10 Table 1 Recommended levels of theory for peptide and protein structure optimization according to Reference 126.

System size Recommended level of theory

Small, medium (<100-150 atoms) PW6B95-D3 with a triple-z AO basis Large (>100-150 atoms) PW6B95-D3-gCP/SVP or 6-31G* HF-D3-gCP/SVP (S)a Very large (>1000 atoms) HF3c (minimal basis) awith polarization functions only on S atoms.

Based on those findings, Goerigk and Reimers were able to identify different levels of theory that they recommended for structure optimizations of polypeptides and it was hoped that those recommendations could also be transferred to proteins (Table 1). In particular, Truhlar’s

PW6B95127 hybrid functional and HF were identified as QM methods of choice.

Thus, two separate studies came to the intriguing conclusion to use HF instead of DFT for structure optimizations of proteins and related compounds. While our previous study on

HEWL had only analyzed the effects of London-dispersion, Goerigk and Reimers’ study also scrutinized effects from BSSE and BSIE, albeit only for smaller structures. Herein, we carry out the logical next step and explore if Goerigk and Reimers’ recommendations can indeed be transferred to proteins. For this purpose we return to the HEWL ensemble. However, this time we optimize a different portion of the protein, which serves as an additional cross-validation of our first study. Furthermore, we present a modification of the original divide-and-conquer approach to allow a better representation of the molecular environment. In contrast to our previous lysozyme study, we focus mostly on root-mean-square deviations (RMSDs) from the

Cartesian coordinates of the experimental X-ray structure. However, we also briefly analyze R

11 factors of our resulting structures, which means that for the first time we report on BSSE effects in a framework closely related to quantum refinement. We concentrate on various levels of theory and in particular we want to report whether dispersion and BSSE corrections do indeed provide better protein structures than the commonly used DFT approaches.

2. Details on Lysozyme

2.1. The initial structure and previous works

Originally discovered by Fleming in 1922,128 lysozyme is the first enzyme to be structurally characterized by X-ray crystallography in 1962.129 It is one of the most studied proteins and therefore it has become an important crystallographic model system. It crystallizes in various polymorphic space groups and consequently many structures have been deposited in the PDB. In 2007, Dauter and co-workers determined the structure of HEWL at an ultrahigh resolution of 0.65 Å and deposited it under the acronym “2VB1” in the PDB.68 This structure contains one protein chain per unit cell that is made up of 129 amino-acid residues. For most of those residues, hydrogen atoms were resolved experimentally. Additional the heavy atoms of some water molecules and some counterions were also identified.

Given the high resolution at which it has been determined and due to its moderate size,

2VB1 makes an ideal benchmark for quantum-chemical procedures. About one third of the atoms in 2VB1 have been identified with multiple sites, reflecting observed but uncorrelated conformational differences between molecules in different unit cells. Falklöf et al. addressed the problem of conformational flexibility in 2VB1 in detail in 2012 and set up a correlated ensemble structure for future quantum-chemical optimizations.69 Due to the triclinic symmetry of the

12 crystal, the simplest choice for a realistic description of the conformational variations was to take an ensemble of eight chemically complete structures. The originally determined occupancies of all atoms were each set to a multiple of one eighth, with all atoms in identified correlated fluctuations (e.g., all atoms in a residue) assigned the same occupancies. This 2x2x2 supercell was then treated within periodic boundary conditions and an MM Monte-Carlo procedure determined the lowest energy ensemble that still represented the measured conformations and their relative weights. For example, if a certain residue has been determined to have 5/8 occupancy, that residue is represented 5 times in that very same conformation in the ensemble.

The eight protein chains per supercell were labeled with letters A to H (see Figure 1). The lowest-energy ensemble was dubbed “2VB1iJ” and we use this structure in the following.

Including all hydrogen atoms (both experimentally observed and subsequently added), the entire superstructure contains 19,335 atoms. While the supercell is repeated in all three dimensions for QM treatments with periodic boundary conditions, it can also be converted back into its original lattice form to be handled by crystallographic packages for further analysis. A resulting pdb data file for such a treatment would then assign an occupancy of 1/8 to every single atom, labeling the atoms from each of the 8 cells within the original supercell as conformers A to

H. Before we started to use 2VBi1J as initial structure for our QM study in 2012, we validated the structure by calculating first separate R factors for each of the eight parts and then for the entire supercell. We showed that while each of the separate structures gave R values increased by at least a factor of 2, only the combined ensemble provided an R factor very similar to the one for the underlying 2VB1 experimental structure.110 Thus, the 2x2x2 supercell is fully consistent with the X-ray scattering data.

13 Falklöf et al. made the important observation that, despite being obtained at very high resolution, only 25% of the residues of 2VB1 are surrounded by a 5 Å region in which the atomic structure is unambiguously defined.69 Therefore it was advised to only optimize those well-defined regions and to leave the structure unchanged for the other. In our previous study110 we followed this recommendation and we do the same herein.

2.2. The lysozyme portion optimized in this work

In 2012, we used the 2VBi1J structure to assess the influence of the DFT-D3 correction on R factors.110 For that purpose we optimized all even-numbered residues in the sequence of

129 residues that were surrounded by the aforementioned well-resolved 5 Å region. In total, our study included 126 copies (in the regions A – H in Figure 1) representing 39 unique residues in the underlying experimental 2VB1 structure.

While this choice ensured that residues of every spatial region in lysozyme were optimized, it left out the residues in between and did not consider any more coherent fragments.

One obvious alternative for the present study would be to optimize all atoms in the 2VB1iJ supercell, but this would include extensive parts of the structure that remain poorly defined.

Instead, we chose to only optimize a portion of 2VB1iJ chosen to reveal anticipated effects arising from dispersion, BSSE and BSIE based on the following requirements. Firstly, the fragment should include at least one disulfide bridge, as such entities were determined to be very sensitive to BSIE.126 Secondly, some of the residues should be conformationally flexible or they should lie in the vicinity of conformationally flexible regions. Thirdly, the optimized fragment should, on the one hand, interact with other atoms of the same HEWL unit, but on the other hand, parts of the fragment should lie at the edge of the unit so that they are likely to interact

14

Figure 2. Depiction of the fragment that is optimized in this work and its location within one molecule of lysozyme. The central cysteine residues 64 and 80 are highlighted in red.

with adjacent HEWL molecules in the same 2x2x2 supercell or in adjacent supercells when periodic boundary conditions are employed. Finally, opposite strands of the fragment should noncovalently interact with each other, which allows crucial testing of the D3 and gCP approaches.

Figure 2 shows a fragment that fulfills all those conditions. It is located around the disulfide bridge between the two cysteine residues CYS-64 and CYS-80. Additionally, we also include a series of residues connected to each end of the cysteines. CYS-64 is embedded in the strand from tyrosine TYR-53 to threonine THR-69. On the opposite side, CYS-80 lies within the sequence from isoleucine ILE-78 to alanine ALA-90. Because we consider a supercell with eight

HEWL units, we simultaneously consider the remaining seven copies in our study. However, owing to the restriction that only residues be considered that lie within a 5 Å region with unambiguously resolved atoms, not all residues in all copies are optimized. In total, we are

15 therefore restricted to optimize 72 residues (see Table S1 in the Supporting Information).

However, those residues interact with other non-optimized parts of the same fragment and the surrounding protein. Therefore, our fragment is still representative to test for effects due to

London-dispersion and BSSE. In other words, by only optimizing the coordinates of atoms in well-defined regions we also ensure that the experimental structure in those regions can serve as an accurate benchmark. How the residues are optimized is described in the following section.

3. Technical details

3.1. The divide-and-conquer optimizer

In this study we apply a divide-and-conquer approach to our problem. The underlying idea of such approaches is depicted in Figure 3. At the very beginning, the large structure is divided into smaller fragments that can be separately treated in an acceptable timeframe. The fragments should be formed such that they are chemically reasonable and still show the same characteristics

Figure 3. Schematic overview of a divide-and-conquer approach for geometry optimizations.

16 as the parent structure. Furthermore, it is important that information from the environment of each fragment is embedded in its treatment. The separate fragments are then treated simultaneously and finally they are recombined. This cycle continues until convergence of the parent structure is achieved. If the entire approach is carefully defined, then the iteratively optimized structure is in principle equivalent to its fully QM-optimized counterpart.

In 2006, Canfield et al. demonstrated the technical capabilities of such an approach, optimizing the PSI trimer with 150,000 atoms under crystallographic boundary conditions at the

PW91130/6-31(+)G* level of theory.38 In their approach each fragment consists of three layers.

The first layer (or core) contains the portion of the protein that is actually being optimized with the QM method; that is either an entire residue, a complex, or a co-factor in the crystal structure.

However, to take into account that this moiety is either covalently or hydrogen bonded to other residues or co-factors, those adjacent units are included in a second shell surrounding the first layer. Also included are residues that form π-π stacks with the core. This second layer is treated with the same QM approach as the first, however the coordinates of the atoms in that layer are kept fixed. In this context, we note that each residue in the second layer forms the core of a separate fragment that is individually treated and optimized. To further take into account other long-range effects, the second layer is surrounded by a third that either forms covalent bonds, hydrogen bonds or p-p stacks with the second. To speed up the calculation this third layer is treated with an MM force field. Therefore the entire fragment is in fact treated with a QM/MM method. In this case the ONIOM method was chosen.131 However, the authors still obtained a fully optimized QM structure and not an ONIOM one because the MM method just provides a background field and mimics the chemical environments that the cores experience.

17 For the formation of the fragments, the authors distinguished between even- and odd- numbered residues and co-factors. First, they optimized the even-numbered, i.e., every second residue. After recombination, they formed new fragments with the odd-numbered residues in the respective cores. After that step, the co-factors were optimized. For the remainder of this manuscript we define that these three steps form one “optimization macro-cycle”. Once the first macro-cycle was finished, the next began again with even-numbered residues being in the fragment cores.

In our first study of the HEWL supercell in 2012, we employed exactly the same fragmentation scheme as Canfield et al. and we tested various QM levels of theory, including different basis sets, density functionals or HF theory, and furthermore we also assessed usage of

DFT-D3. After our study had been published and in preparation for this present investigation we have analyzed the original approach further and identified two problematic features.

Firstly, the treatment of disulfide bridges turned out to be difficult and we found that the original approach sometimes broke S-S bonds and capped them with a hydrogen atom.

Moreover, we now know from Goerigk and Reimers’ small-peptide study that disulfide bridges are very sensitive to the BSIE. Therefore, we recommend optimizing cysteine dimers simultaneously in one core instead of treating them separately.

The second problem is depicted in Figure 4a, which shows an isoleucine (ILE) residue in the central core and the second QM layer around it. Because there are no p-p stacks or hydrogen bonds for this residue, the second layer merely consists of the residues directly bound to the ILE backbone. Of course, important dispersion interactions should be expected between the aliphatic side chain and its environment. The original assumption was that such interactions would be taken into account by the force field during the ONIOM treatment (the third MM layer was left

18

Figure 4. Isoleucine fragment in the original38 (a) and in the herein modified approach (b). The tubular model represents the isoleucine in the central QM core, the ball-and-stick model the middle layer that is also treated with QM. The surrounding MM layer is not shown for clarity.

out of Figure 4 for clarity). However, because we want to fully make use of the D3 method, we rather prefer to directly treat London-dispersion interactions with the nearest neighbors on the

QM level, as would be the case for the layers shown in Figure 4b.

To ensure that such interactions are taken into account on the QM level we therefore modify the original code. It still looks for covalently and hydrogen bonded residues, but moreover we introduce two different distance criteria. First, we define a radius around the central amino acid and all other residues within that radius are included in the fragment. We then scan through this fragment with a second, smaller radius. All residues that fall into this radius are included in the second layer (QM), and the residues outside that radius belong to the third (MM).

19 In preliminary calculations we tried to estimate the best radii that still allow efficient treatment of the fragments but at the same time do not miss any important interactions. We found that residues within 10Å of the central core should be included in the MM layer, while a radius of 4

Å was sufficient for our purposes for the QM layer and we adopt these radii throughout this study.

3.2 Other computational details

The fragments were generated and recombined with our own standalone code which has evolved from the original program used for the PSI optimization.38 As mentioned in the previous section, each optimization macro-cycle consists of three smaller cycles. First we carried out a few optimization steps for the even-numbered residues (except cysteines). After recombination and subsequent fragmentation, we performed the same with odd-numbered residues. Finally, we considered only cysteine dimers. At the beginning we only perform a few optimization steps for each fragment (around 10), however upon approaching convergence, we increased the number of steps.

The ONIOM131 treatments were carried out as implemented in GAUSSIAN09 Rev

C.01.132 However, in contrast to our own and Canfield et al.’s previous studies, we decided to use

Gaussian only as the geometry optimizer and to calculate the MM force-field contribution. All actual QM calculations were carried out with TURBOMOLE 6.4133-134 and linked with

GAUSSIAN through its keyword “external” and a standalone script. We made use of Grimme’s

DFT/HF-D3 correction with Becke-Johnson damping,99 which is also implemented in

TURBOMOLE. Originally, this correction had been dubbed DFT/HF-D3(BJ), however, as the

Becke-Johnson version had become default in such calculations, the suffix “BJ” has been

20 skipped in later publications and we do the same herein. Contributions to the energy and gradients from the gCP42 correction were calculated with Grimme’s standalone program.135

In this approach the exact type of force field is only of minor relevance because it is mainly needed to mimic the long-range chemical environment for the residue in the center of each fragment. For the sake of simplicity we chose the AMBER force-field136 with B3LYP76-77/6-

31G*78 atomic charges taken from the previous studies.38, 69 In preliminary calculations we also explored the possibility of electronic embedding in the ONIOM calculations, i.e. the MM charges directly influence the QM Hamiltonian. However, we observed that the results deteriorated and therefore we continued only with the standard ONIOM approach (see SI). We initially tested the following QM methods: the BLYP,74, 137 BP8674-75 and PBE138 generalized- gradient-approximation (GGA) density functionals, the TPPS139 meta-GGA density functional, the B3LYP, and PW6B95127 hybrid functionals, and finally HF. However, as described in the next section, we later reduced the number of tested methods.

We used the two double-z AO basis sets that had been recommended in Goerigk and

Reimers’ small-peptide study.126 The first basis set is Pople’s 6-31G* set with polarization functions on all atoms except on hydrogen. The second is Ahlrichs' SV79 basis set, however, as outlined in the small-peptide study, we additionally added polarization functions taken from the

SVP basis on sulfur atoms, as mere SV calculations showed a large BSIE for S-S bond lengths.126

In the rest of this paper we call this modified basis set “SVP(S)”. The calculation of the Coulomb contributions for all methods, including the hybrid functionals and HF, were accelerated with the resolution-of-the-identity (RI-J) approach and auxiliary basis functions were taken from the

TURBOMOLE library.140 For the gCP correction separate parameters have to be used for DFT and HF calculations, and also for the two different basis sets.42 For the SVP(S) basis, we used the

21 parameters determined for SV, as sulfur-containing species had not been part of the gCP training set. All DFT calculations were carried out with TURBOMOLE’s quadrature grid “m4”.141 We did not observe any effects from employing larger grids.

-7 The convergence criterion for each SCF step was set to 10 Eh. During our first calculations we noticed convergence problems in the self-consistent-field (SCF) calculations for all density functionals due to small or even slightly negative HOMO-LUMO gaps. Surprisingly, this problem had only been addressed sporadically in the literature.41, 61, 142 We follow here the recommendation to solve this problem by employing the COSMO143 continuum solvation model with a dielectric constant of 4.41 We do want to stress that this only a technical means to ensure smooth performance of the DFT calculations. HF does not show this convergence problem, which makes the self-interaction-error in DFT the likely source of the problem. Our HF calculations were therefore conducted without COSMO. However, we discuss COSMO effects on HF in the SI. The issue of convergence problems for DFT is further outlined in section 4.2.

Contrary to our previous study on HEWL, but in line with the small-peptide study by

Goerigk and Reimers, we mainly focus our analysis on root-mean square deviations (RMSDs) involving all Cartesian coordinates of the atoms in the optimized HEWL portion. All RMSDs were calculated with respect to the original 2V1BiJ ensemble. Therefore, the analysis involves all eight copies A-H. All main findings and conclusions are based on the RMSD analysis.

Additionally, we end the discussion with a short outlook on quantum-refinement procedures (see section 4.5 for more details).

22 4. Results and discussion

4.1 Snapshot after three optimization macro-cycles

We begin our discussion by taking “snapshots” of the optimization procedure after three macro-cycles to see if we can already identify genuine trends for the various methods. In each of the three cycles each investigated residue underwent ten optimization steps. Even though the structures have not fully converged yet, this already allows us to analyze the effects of adding the

D3 and gCP corrections and to preselect promising methods for which the fragment is subsequently optimized until convergence.

Inspired by the previous analysis of the D3 and gCP corrections for gas-phase peptide structures,126 we decided to analyze them herein for three representative methods: BP86, B3LYP and HF. Figure 5 shows the RMSDs of the Cartesian coordinates of all optimized atoms for the uncorrected methods and for various combinations of the applied corrections. All numbers were obtained with the 6-31G* AO basis set. For all methods we see an increase in the RMSDs when they are purely BSSE corrected. This increase is in particular larger for the two DFT methods than for HF. This means there is an influence of BSSE on the structures obtained with the pure approaches. The fact that the RMSD for the uncorrected methods is better than for the BSSE- corrected ones parallels previous findings for peptide gas-phase structures and for organic- chemical reaction and activation energies.106, 124 Similarly to those previous studies we argue that the uncorrected methods benefit from error-compensation. Cancellation of errors is likely to be one of the reasons why for instance BP86/6-31G* has become a popular method for protein structure optimization. However, as argued previously, controlling these cancellation effects can be a difficult task and provides ample room for uncertainty in the analysis of such results.

23

Figure 5. Snapshot after three optimization macro-cycles: root-mean-square deviations (RMSDs) in Å for all atoms in the optimized fragment for BP86, B3LYP and HF with and without the D3 and gCP corrections are shown. The 6-31G* AO basis set was employed in all cases.

Figure 5 also shows the effects of just adding the D3 correction and of adding both corrections simultaneously with the latter variant yielding the lowest RMSDs for all three methods. At a first glance it may seem puzzling why the combination of both corrections has a different overall effect on the RMSDs than the application of just one of them. For instance,

BP86-gCP leads to an increased RMSD while BP86-D3 has a similar RMSD compared to BP86.

One would intuitively expect BP86-D3-gCP to also show an increase, but instead one observes an overall decrease. Nevertheless, this parallels what has been found in gas-phase calculations and we would like to point out that although the corrections are additive in nature, we do not discuss mere single-point energies but structure optimizations. Each method combination provides a different potential energy surface and it is therefore not surprising that a combination of D3 and gCP leads to a different minimum in that energy landscape than when just one of the two corrections is used.

24 Even though we can detect specific trends, we also notice that the difference in the

RMSDs between the various combinations is smaller than for gas-phase peptide structures. For instance, the difference between the RMSDs for BP86-gCP and BP86-D3-gCP was reported to be 0.6 Å on average for di- and tripeptides,126 whereas it is significantly smaller for the investigated lysozyme fragment. This can be attributed to a higher structural rigidity of the protein environment. Nevertheless, the reported trends are encouraging and we will employ the

D3-gCP combination in the remainder of this study. To ensure that the trends can also be observed for a fully optimized structure, we continue to carry out optimizations with the uncorrected BP86 functional.

Finally, our snapshot allows us a first method ranking. The RMSDs for a variety of methods are shown in Table S2 in the SI. We observe that (meta-) GGA functionals show

RMSDs between 0.215 Å (BP86-D3-gCP) and 0.228 Å (BLYP-D3-gCP). They are outperformed by the hybrid functionals B3LYP-D3-gCP (0.203 Å) and PW6B95-D3-gCP (0.182

Å). HF-D3-gCP has an RMSD slightly higher than the best hybrid (0.193 Å). Table S2 also lists mean absolute deviations (MADs) for the dihedral backbone angles, however no particular trends can be observed and most MADs are around 1.8∘.

It is likely that for some methods the RMSDs can increase in subsequent optimization steps as the atoms move further away from their initial coordinates. However, based on this preliminary snapshot and the recommendations given by Goerigk and Reimers, we decided against continuing this study with all of these approaches for the sake of computational efficiency. We therefore focus mostly on methods that are either popular in the community or that have been shown to be beneficial for peptide structures, viz. BP86, BP86-D3-gCP, B3LYP-

D3-gCP, PW6B95-D3-gCP and HF-D3-gCP.

25 4.2 Convergence of the SCF and optimization procedures

Figure 6. a) Portion of fragments for which the very first SCF calculation failed when no

COSMO model is employed. b) Average steps in successful SCF calculations without and with the COSMO model (e=4). c) Portion of structurally converged residues after three optimization macro-cycles.

As mentioned in section 3.2, SCF convergence problems for DFT calculations in polypeptides have only been addressed sporadically41, 61, 142 and those reports have not had much influence on subsequent studies. Our investigation also allows us to provide more evidence for this problem.

Figure 6a shows the portion of fragments for which the SCF calculation in the very first optimization step failed when no COSMO solvation model was used. For (meta-)GGAs this was the case for about one third of all calculations. Different functionals displayed this problem for different residues and we do not perceive a general reason for the convergence failure other than small HOMO-LUMO gaps, as expected for (meta-)GGAs.142 Although it is possible to manually solve this problem with standard tricks, such as level shifting or quadratic convergence, one has to consider the large number of calculations and the fact that in each case a different solution has to be found. Overall, this makes the application of such functionals less attractive when aiming at

26 a procedure that can be carried out smoothly with a minimum of technical problems. The easiest way to assure smooth convergence behavior without much manual input is to use a continuum solvation model, such as COSMO, with a small dielectric constant of e.g. 4, as suggested by

Antony and Grimme.41 Indeed, when employing COSMO, we experienced no convergence issues for the problematic functionals.

Figure 6a shows that none of the SCF calculations failed for hybrid functionals and HF.

However, Figure 6b reveals that without COSMO the SCF convergence is on average slower for hybrids. For some residues we observed that about 200 SCF steps were necessary. On average around 40 or more SCF steps were needed, while this number was reduced to around 16 with

COSMO. HF without COSMO does not have this problem and converged smoothly in on average 16 steps (see SI for COSMO effects on HF).

SCF convergence problems have also been reported recently144 for gas-phase proteins optimized using semi-empirical HF methods34 that were also traced back to the basic nature of a protein as a low band-gap material. Moreover, it was demonstrated that application of localized- molecular-orbital (LMO62) techniques can lead to artificially over-polarized wave functions that do not converge to the full-SCF wave function. Both of these problems were shown to be overcome using implicit solvent models.144 The linear scaling method that we employ is similar to the LMO approach but circumvents convergence to incorrectly polarized states by always including strongly coupled ion pairs within in the same fragment. The good convergence that we find for HF calculations without using implicit solvent models therefore could reflect differences between ab initio and semi-empirical HF schemes but could also arise as we only optimize residues for which a 5 Å coordination shell is guaranteed, therefore demanding that explicit solvation is always present. While in the context of full protein optimization, explicit solvent can

27 sometimes be sufficient to stabilize even DFT calculations,38 implicit solvent may in general still be required for ab initio HF.

Another time-determining problem in protein optimizations is the actual geometry convergence. To get a first idea about this, we analyze the intermediate structures after three optimization macro-cycles. As shown in Figure 6c, convergence was signaled for one half of the residues with HF-D3-gCP. For DFT methods this was only the case for 15 to 21% of the residues. Although even for those 50% the final geometries may change slightly during the subsequent optimization steps, this is already a good indicator for the overall geometry convergence behavior of HF-D3-gCP. Indeed, for the structures reported in the next section, the

HF-D3-gCP optimizations were the first to finish, while for all DFT calculations we needed at least double the amount of time. Thus, we can conclude in the context of our study that technical benefits such as faster SCF and geometry convergence for HF-D3-gCP outweigh the disadvantage of dealing with the relatively costly evaluation of Fock exchange.

4.3 Fully converged structures for 6-31G*

We continue our discussion with the results for fully converged structures obtained with the methods selected in section 4.1 and the 6-31G* basis set. Figure 7 shows the RMSDs for each individual residue (numbers are shown in Table S3). Each residue is represented in different lysozyme copies within the 2x2x2 supercell and therefore may experience a selection of slightly different chemical environments. This is indicated by the fact that for some residues and methods fluctuations in the RMSDs between the different copies are observed. As shown in Figure 7, some residues, such as glycine or cysteine, show smaller RMSDs than, for instance, serine or isoleucine. Nevertheless, the same fluctuations are detected for all tested methods, which is why

28

Figure 7. Root-mean-square deviations (Å) for various methods for each of the optimized residues in the lysozyme ensemble. The calculations were carried out with the 6-31G* AO basis.

Figure 7 still allows us to identify method-specific trends. Overall, pure BP86 has higher RMSDs than the other approaches in most cases. B3LYP-D3-gCP seems to generally give better RMSDs, which are further reduced by PW6B95-D3-gCP. Except for outliers for three of the LEU-83 copies, HF-D3-gCP outperforms the DFT methods.

Next, we analyze if those trends perceived visually in Figure 7 can also be seen for overall

RMSDs shown in Table 2. First of all, we observe that RMSDs for the fully converged lysozyme fragment increase compared to preliminary results discussed in section 4.1. However, for most methods this increase is only marginal and the numbers shown in Table 2 confirm our preliminary first insights. In particular, we notice again that a combination of the D3 and gCP corrections is beneficial, as shown for the BP86 density functional. In line with our analysis ofFigure 7 we confirm that PW6B95-D3-gCP is the best density functional with an RMSD of

1.92 Å. Surprisingly, HF-D3-gCP has the same overall RMSD as B3LYP-D3-gCP (0.205 Å).

However, we attribute this mostly to the three aforementioned outliers.

29 Table 2. Statistical values for the optimized lysozyme fragment for various levels of theory.

a b c d e RMSD MADbonds MADH-bonds MDS-S-bonds Rfree

6-31G* basis BP86 0.229 0.013 0.191 0.063 16.11 BP86-D3-gCP 0.217 0.015 0.186 0.063 16.45 B3LYP-D3-gCP 0.205 0.010 0.155 0.056 15.83 PW6B95-D3GCP 0.192 0.007 0.161 0.035 15.18 HF-D3-GCP 0.205 0.013 0.153 0.017 15.22 SVP(S) basis HF-D3-GCP 0.186 0.009 0.178 0.030 15.48 aRoot-mean-square deviations for all atoms (Å). bMean absolute deviation (Å) over 517 heavy-atom bond lengths. c Mean absolute deviation (Å) over 12 selected interresidual hydrogen bonds. dMean deviation (Å) over eight S-S bond lengths. e”Free” R factor (%) after isotropic B-factor refinement.

Figure 8. Number residues that a specific method describes with the respective best or worst

RMSD. The numbers are based on 6-31G* calculations.i

i Sometimes several methods equally gave the best or worst value for a particular residue and both results were counted, hence the sums of the values depicted in the figure may exceed the number of samples, 72.

30 Figure 8 shows for how many residues a specific method provides the best and worst

RMSDs. Despite the four outliers, HF-D3-gCP provides the best RMSDs in 60 of the considered cases, followed by PW6B95-D3-gCP with 16 cases. According to this analysis B3LYP-D3-gCP performs on average, while BP86 with and without corrections gives the largest RMSDs in 35 and 37 cases. Overall, the findings for the RMSDs are in good agreement with the previous predictions based on gas-phase peptide structures. PW6B95-D3-gCP is the best of the tested

DFT methods, but HF-D3-gCP appears to be a more robust alternative.

Table 2 also shows other statistical values, which however are more difficult to interpret than the RMSDs. The MADs for all heavy-atom bond lengths in the optimized residues are very similar for all tested methods, while the BP86-based approaches give slightly larger MADs for a subset of twelve hydrogen bonds. In section 4.5 we analyze whether R factors could be used as another indicator for method accuracy, but prior to that it is worthwhile to discuss the basis-set dependence of our results.

4.4 Beneficial basis-set effects for Hartree-Fock

The gCP correction only corrects for the BSSE and all remaining effects due to the BSIE remain unchanged. Given that we finally would like to recommend methods for a timely efficient treatment of proteins with the highest possible accuracy, we have to make a compromise and take the interplay between basis set and method into account. In particular Goerigk and Reimers have demonstrated that for peptides the 6-31G* basis set gives ideal bond lengths for DFT methods, while for HF its was shown that the SV basis set should be chosen, i.e. a basis set without polarization functions.126 The only exception to this was the treatment of sulfur atoms for

31 which polarization functions were shown to be critical. The compromise made for peptides was therefore to use the SVP (S) basis set (see section 3.2).

Results for the fully optimized fragment at the HF-D3-gCP/SVP(S) level of theory are also shown in Table 2 and they confirm that this approach indeed provides better results than with 6-31G*. The final RMSD of 0.186 Å is in fact lower than that for PW6B95-D3-gCP/6-

31G*. Table S2 in the SI shows that RMSDs for DFT/SVP(S) increase, which means that the predictions based on peptide gas phase structures can be exactly reproduced for this real-life case of lysozyme. Using the SVP(S) basis has furthermore the advantage of being computationally faster than 6-31G*, which makes it particularly attractive for the treatment of large systems.

4.5 R factor analysis

An R factor is defined as the difference between the experimentally observed X-ray reflections

(Fobs and Fcalc) and those derived from the computationally refined protein model:

!!" �!"#(ℎ, �, �) − � �!"#!(ℎ, �, �) � = , (1) !!" �!"#(ℎ, �, �) where w is a scale factor and h, k, l are the Miller indices. The R factor should decrease in each refinement step. While it is not common to optimize the model by directly minimizing R, this measure remains the most common to assess the quality of the protein model; for alternatives to gauge structural quality, see e.g. Reference 145. Usually the experimental reflections are divided into two sets. The larger “working set” is used for the actual refinement and it also provides the

70 71 factor Rwork. The remaining "free" reflections are then used for cross-validation (factor Rfree).

32 A standard crystallographic macro-molecular refinement procedure does not only optimize the atomic coordinates, but also temperature factors to take into account atomic large- amplitude motions, which can yield a blurred picture of the density. Furthermore, fitting of such factors is also used to compensate for uncertainties in non-resolved conformations. Temperature factors are either refined isotropically with one parameter per atom or anisotropically with six parameters per atom. The latter procedure is only done at very high resolutions and if the number of observed reflections is high. Otherwise the problem of overfitting can occur which results in an artificial decrease of R factors due to a large parameter space instead of real structural features.

A conventional refinement procedure optimizes the parameters against the “chemical energy” of the system, arising from the empirical geometrical restraints, and an X-ray penalty function that depends on the measured reflections. In quantum refinement the chemical energy is derived from quantum-chemical calculations.1-10 Note that in this study we do not perform quantum refinement in this conventional way because our main target is to assess various methods first. Nevertheless we can use our fully optimized structures to subsequently calculate an R factor. While these R values will be different from those obtained using a quantum- refinement procedure, they can still give valuable insight into a method’s quality. We have to consider, though, that our optimizations may alter the chemical environment, including that of the nonoptimized atoms. Given that temperature factors are also affected by nonlocal changes, we therefore have to perform a temperature-factor refinement of all atoms in the ensemble.

The original 2VB1 structure contains 1,545 atoms and was refined against a working set of 175,025 reflections, while 9,365 reflections were used for cross-validation.68 The original procedure included a fit of atomic occupancies and determination of anisotropic temperature

33 factors. The 2VBi1J ensemble on the other hand contains 19,335 atoms. Occupancies were fixed to 1/8, as explained in section 2.1, but an anisotropic temperature-factor refinement is not advisable due to a low reflection-parameter ratio. Therefore we carried out isotropic refinement of temperature factors only, as also was done in our previous HEWL study in 2012.110 In fact, therein we showed how the anisotropy of the system resulting from its conformational flexibility is partially taken into account by the ensemble itself. Moreover, we do not refine the atomic coordinates, which is why the resulting ratio between number of working-set reflections and arbitrarily fitted isotropic temperature factors is 9.1, a ratio that is sufficiently high to allow convergence of the temperature-factors and to conduct meaningful refinement. All in all, the trends in our R values still give meaningful insight into each method’s quality.

146-147 We obtained Rfree values with a modified version of REFMAC 5.5 and they are

shown in Table 2 (see SI for Rwork). For the structures obtained with the 6-31G* basis, those values follow very similar trends as the overall RMSDs discussed in section 4.3, with the BP86 approaches yielding values larger than 16%, followed by B3LYP-D3-gCP with 15.8%.

PW6B95-D3-gCP and HF-D3-gCP both show values around 15.2%, thus confirming again that these two approaches produce more accurate results. Contrary to the RMSD discussion is the

finding that HF-D3-gCP/SVP(S) shows a slightly increased value of Rfree = 15.5% compared to

HF-D3-gCP/6-31G*, which however is still better than for the standard BP86/6-31G* and

B3LYP/6-31G* levels of theory.

Table S4 lists R values when only the optimized eight copies of one particular residue are used in the refinement while the remainder maintains the initial coordinates. Interestingly, in most cases, the R factors are the same as for 2VB1iJ (14.3%) or slightly below. Also, the differences between the methods are nearly vanishing in most cases. Exceptions are TYR-53 and

34 the cysteines. The latter show a pronounced increase in the R values, which correlates with the mean deviations for the S-S bond lengths (Table 2). Sulfur is richer than the other elements in the system and hence the refinement programs are more sensitive to changes in the sulfur coordinates. Therefore, the interplay between method and BSIE is very important.

An additional interesting finding is that although optimization of many of the single residues yield similar R factors, which are often even marginally lower than for the initial structure, the combined sequential optimization of all of them leads to a nonlinear increase in the

R factor, reflecting the intricate nonlocality of temperature factors. These findings give useful insight for future quantum refinement studies in which also the atomic coordinates are refined.

The aforementioned QM/MM refinement studies were mostly carried out at the BP86-6-31G* level of theory, but nevertheless they were superior to conventional refinement techniques leading to chemically much better descriptions of the systems under investigation.1-10 Our results indicate that even better results may be achieved by adding the D3 and gCP corrections to BP86 or by completely changing the refinement protocol to HF-D3-gCP. This hypothesis has to be carefully addressed in future studies.

5. Summary and conclusion

We optimized a fragment of an ensemble of eight lysozyme structures69 using periodic boundary conditions with the help of a modified divide-and-conquer approach. The lysozyme ensemble is based on highly accurate X-ray crystallographic data68 and it therefore served as a reliable benchmark. This choice of reference allows thorough assessments of quantum-chemical methods on amino-acid residues embedded in a real biochemical environment, which is a clear advantage over standard benchmark studies of small molecules in the gas phase. Our aim was to

35 assess the effects of corrections to DFT and HF methods for the London-dispersion and BSSE problems and to compare them with the popular BP86/6-31G* level of theory that is normally used in protein optimizations and quantum-refinement of protein X-ray structures.

While average changes in most bond lengths and dihedral backbone angles seem not to be ideal means to gauge method quality, we obtained useful results from RMSDs of the

Cartesian coordinates for all optimized atoms and an analysis of crystallographic R values after isotropic temperature-factor refinement. Due to the rigidity of the system, differences in RMSDs between various methods turned out to be smaller than reported by Goerigk and Reimers for gas- phase peptide structures.126 Nevertheless, we demonstrated that method rankings were exactly reproduced as predicted in that previous study. The validity of our findings is strongly underlined by the fact that both a gas-phase benchmark study with ab initio reference structures and a study based on an experimental reference in a biochemically relevant environment give very nearly identical trends.

We found that standard DFT approaches at the double-z basis-set level rely on error compensation between the lack of a proper description of London-dispersion and an overstabilization from BSSE. This is probably one of the main reasons for their frequent application in the literature. However, we agree with authors of previous studies that such error cancellation effects may be difficult to control.124, 126 In fact, we see that Grimme’s D398-99 and gCP42 corrections for dispersion and BSSE improve our statistical values and provide a more balanced description of our structure, both for DFT and HF approaches. At the 6-31G* basis-set level we observe that even when those two corrections are applied, the popular BP86 and B3LYP functionals are outperformed by the PW6B95 hybrid functional and HF. Moreover, HF-D3-gCP

36 further improves with the SV basis set for all atoms, except for sulfur for which SVP has to be used; another finding that confirms our previous recommendations (see Table 1). This level of theory is especially useful to save computer time.

At a first glance, the computational cost related with the evaluation of two-electron exchange integrals would be an argument against HF-D3-gCP. However, we also demonstrated that DFT methods (also at the hybrid level) show either complete failure of the self-consistent- field (SCF) step or a generally slow SCF convergence, thus confirming the scarce number of previous studies discussing this problem for polypeptides and proteins. 41, 61, 142 While employing a continuum solvation model such as COSMO143 with a small dielectric constant is a solution,

HF theory, applied to at least well-characterized regions of the protein structure, does not suffer from this issue and therefore offers an intriguing way to routine applications with a minimum of technical problems. Moreover, HF-D3-gCP optimizations converged significantly faster than their DFT counterparts. Also, resolution-of-the-identity140, 148 or chain-of-sphere149 techniques offer a significant speed-up in the exchange-integral calculation in many QM codes. Therefore, no technical aspects should speak against HF-D3-gCP.

We expect our findings to be general and independent of the specific optimization procedure. On the contrary, the fact that the D3 dispersion correction with Becke-Johnson damping gives accurate structures for HF parallels previous studies on organic-chemical systems.99, 114 Moreover, we demonstrated the additional benefit of employing the gCP correction for intramolecular BSSE for a biologically relevant molecule. The D3 method has been implemented into most major quantum-chemistry codes, but also usage of gCP is straightforward with Grimme’s standalone program135 or Neese’s ORCA 3.0 program

37 package.150 In light of our results we encourage developers to also implement gCP into their codes.

In summary, we strongly recommend usage of HF-D3-gCP for optimizing polypeptides and proteins and for quantum refinement. We hope that our findings are of interest to a large community of QM and QM/MM users and influential for future applications.

Supporting Information. Optimized structures in pdb and Hyperchem format compiled in a zip archive. The additional pdf file contains tables with more information on optimized residues,

RMSDs and R values, and also a short discussion on effects of point charges in the ONIOM treatment and of COSMO on HF-D3-gCP. This material is available free of charge via the

Internet at http://pubs.acs.org.

Corresponding Authors

*E-mail: [email protected]; [email protected]

Phone: +61 3 9035 7419 (Lars Goerigk); +86 15618155341 (Jeffrey R. Reimers)

Acknowledgment

L. G. acknowledges funding from the German Academy of Sciences “Leopoldina” until March

2014 (grant ID: LPDS 2011-11). As of April 2014, L. G. is the recipient of an Australian

Research Council (ARC) Discovery Early Career Researcher Award (project number:

DE140100550), and funding from the Selby Scientific Foundation 2014 Selby Research Award is also acknowledged. Finally, we are also grateful for generous allocation of computing time from the National Computational Infrastructure (NCI) National Facility under the National

Computational Merit Allocation Scheme and from Intersect Australia Ltd (projects d63, fk5, and r88).

38

References

1. Ryde, U. Accurate Metal-Site Structures in Proteins Obtained by Combining Experimental Data and Quantum Chemistry. Dalton Trans. 2007, 607-625. 2. Ryde, U.; Nilsson, K. Quantum Refinement - a Combination of Quantum Chemistry and Protein Crystallography. J. Mol. Struc.-THEOCHEM 2003, 632, 259-275. 3. Genheden, S.; Ryde, U. A Comparison of Different Initialization Protocols to Obtain Statistically Independent Molecular Dynamics Simulations. J. Comput. Chem. 2011, 32, 187- 195. 4. Ryde, U.; Nilsson, K. Quantum Chemistry Can Locally Improve Protein Crystal Structures. J. Am. Chem. Soc. 2003, 125, 14232-14233. 5. Nilsson, K.; Hersleth, H. P.; Rod, T. H.; Andersson, K. K.; Ryde, U. The Protonation Status of Compound II in Myoglobin, Studied by a Combination of Experimental Data and Quantum Chemical Calculations: Quantum Refinement. Biophys. J. 2004, 87, 3437-3447. 6. Nilsson, K.; Lecerof, D.; Sigfridsson, E.; Ryde, U. An Automatic Method to Generate Force-Field Parameters for Hetero-Compounds. Acta Crystallogr. Sect. D. Biol. Crystallogr. 2003, 59, 274-289. 7. Rulisek, L.; Ryde, U. Structure of Reduced and Oxidized Manganese Superoxide Dismutase: A Combined Computational and Experimental Approach. J. Phys. Chem. B 2006, 110, 11511-11518. 8. Ryde, U.; Olsen, L.; Nilsson, K. Quantum Chemical Geometry Optimizations in Proteins Using Crystallographic Raw Data. J. Comput. Chem. 2002, 23, 1058-1070. 9. Hsiao, Y.-W.; Sanchez-Garcia, E.; Doerr, M.; Thiel, W. Quantum Refinement of Protein Structures: Implementation and Application to the Red Fluorescent Protein Dsred.M1. J. Phys. Chem. B 2010, 114, 15413-15423. 10. Hsiao, Y.-W.; Thiel, W. Pb(2) Intermediate of the Photoactive Yellow Protein: Structure and Excitation Energies. J. Phys. Chem. B 2011, 115, 2097-2106. 11. Karasulu, B.; Patil, M.; Thiel, W. Amine Oxidation Mediated by Lysine-Specific Demethylase 1: Quantum Mechanics/Molecular Mechanics Insights into Mechanism and Role of Lysine 661. J. Am. Chem. Soc. 2013, 135, 13400-13413. 12. Sumner, S.; Söderhjelm, P.; Ryde, U. Effect of Geometry Optimizations on QM-Cluster and QM/MM Studies of Reaction Energies in Proteins. J. Chem. Theory Comput. 2013, 9, 4205- 4214. 13. Retegan, M.; Neese, F.; Pantazis, D. A. Convergence of QM/MM and Cluster Models for the Spectroscopic Properties of the Oxygen-Evolving Complex in Photosystem II. J. Chem. Theory Comput. 2013, 9, 3832-3842. 14. Liao, R.-Z.; Thiel, W. Determinants of Regioselectivity and Chemoselectivity in Fosfomycin Resistance Protein Fosa from QM/MM Calculations. J. Phys. Chem. B 2013, 117, 1326-1336. 15. Gomez, H.; Rojas, R.; Patel, D.; Tabak, L. A.; Lluch, J. M.; Masgrau, L. A Computational and Experimental Study of O-Glycosylation. by Human UDP-GalNAc Polypeptide:GalNAc Transferase-T2. Org. Biomol. Chem. 2014, 12, 2645-2655.

39 16. Jaña, G. A.; Delgado, E. J.; Medina, F. E. How Important Is the Synclinal Conformation of Sulfonylureas to Explain the Inhibition of AHAS: A Theoretical Study. J. Chem. Inf. Model. 2014, 54, 926-932. 17. Campomanes, P.; Neri, M.; Horta, B. A. C.; Röhrig, U. F.; Vanni, S.; Tavernelli, I.; Rothlisberger, U. Origin of the Spectral Shifts among the Early Intermediates of the Rhodopsin Photocycle. J. Am. Chem. Soc. 2014, 136, 3842-3851. 18. Geronimo, I.; Paneth, P. A DFT and ONIOM Study of C-H Hydroxylation Catalyzed by Nitrobenzene 1,2-Dioxygenase. Phys. Chem. Chem. Phys. 2014, 16, 13889-13899. 19. Hou, C.; Liu, Y.-J.; Ferré, N.; Fang, W.-H. Understanding Bacterial Bioluminescence: A Theoretical Study of the Entire Process, from Reduced Flavin to Light Emission. Chem. Eur. J. 2014, 20, 7979–7986. 20. Aleksandrov, A.; Zvereva, E.; Field, M. The Mechanism of Citryl-Coenzyme A Formation Catalyzed by Citrate Synthase. J. Phys. Chem. B 2014, 118, 4505-4513. 21. Hargis, J. C.; White, J. K.; Chen, Y.; Woodcock, H. L. Can Molecular Dynamics and QM/MM Solve the Penicillin Binding Protein Protonation Puzzle? J. Chem. Inf. Model. 2014, 54, 1412–1424. 22. Daniels, A. D.; Campeotto, I.; van der Kamp, M. W.; Bolt, A. H.; Trinh, C. H.; Phillips, S. E. V.; Pearson, A. R.; Nelson, A.; Mulholland, A. J.; Berry, A. Reaction Mechanism of N- Acetylneuraminic Acid Revealed by a Combination of Crystallography, QM/MM Simulation, and Mutagenesis. ACS Chem. Biol. 2014, 9, 1025-1032. 23. Thellamurege, N. M.; Hirao, H. Effect of Protein Environment within Cytochrome P450cam Evaluated Using a Polarizable-Embedding QM/MM Method. J. Phys. Chem. B 2014, 118, 2084-2092. 24. Hargis, J. C.; Vankayala, S. L.; White, J. K.; Woodcock, H. L. Identification and Characterization of Noncovalent Interactions That Drive Binding and Specificity in DD- Peptidases and Β-Lactamases. J. Chem. Theory Comput. 2014, 10, 855-864. 25. Silva-Junior, M. R.; Mansurova, M.; Gärtner, W.; Thiel, W. Photophysics of Structurally Modified Flavin Derivatives in the Blue-Light Photoreceptor YtvA: A Combined Experimental and Theoretical Study. Chem. Bio. Chem. 2013, 14, 1648-1661. 26. Harris, T. V.; Morokuma, K. QM/MM Structural and Spectroscopic Analysis of the Di- Iron(II) and Di-Iron(III) Ferroxidase Site in M Ferritin. Inorg. Chem. 2013, 52, 8551-8563. 27. Liao, R.-Z.; Thiel, W. Convergence in the QM-Only and QM/MM Modeling of Enzymatic Reactions: A Case Study for Acetylene Hydratase. J. Comput. Chem. 2013, 34, 2389- 2397. 28. He, X.; Wang, B.; Merz, K. M. Protein NMR Chemical Shift Calculations Based on the Automated Fragmentation QM/MM Approach. J. Phys. Chem. B 2009, 113, 10380-10388. 29. Ucisik, M. N.; Dashti, D. S.; Faver, J. C.; Merz, K. M. Pairwise Additivity of Energy Components in Protein-Ligand Binding: The HIV II Protease-Indinavir Case. J. Chem. Phys. 2011, 135, 085101. 30. Ionescu, C.-M.; Geidl, S.; Svobodová Vařeková, R.; Koča, J. Rapid Calculation of Accurate Atomic Charges for Proteins via the Electronegativity Equalization Method. J. Chem. Inf. Model. 2013, 53, 2548-2558. 31. Lyne, P. D.; Hodoscek, M.; Karplus, M. A Hybrid QM−MM Potential Employing Hartree−Fock or Density Functional Methods in the Quantum Region. J. Phys. Chem. A 1999, 103, 3462-3471.

40 32. Burger, S. K.; Schofield, J.; Ayers, P. W. Quantum Mechanics/Molecular Mechanics Restrained Electrostatic Potential Fitting. J. Phys. Chem. B 2013, 117, 14960-14966. 33. Shi, T.; Han, Y.; Li, W.; Zhao, Y.; Liu, Y.; Huang, Z.; Lu, S.; Zhang, J. Exploring the Desumoylation Process of SENP1: A Study Combined MD Simulations with QM/MM Calculations on SENP1-SUMO1-RanGAP1. J. Chem. Inf. Model. 2013, 53, 2360-2368. 34. Clark, T.; Stewart, J. J. P., MNDO-Like Semiempirical Molecular Orbital Theory and Its Application to Large Systems. In Computational Methods for Large Systems: Electronic Structure Approaches for Biotechnology and Nanotechnology, Reimers , J. R., Ed. Wiley: Chichester, 2011; pp 259-286. 35. Dixon, S. L.; Merz, K. M. Semiempirical Molecular Orbital Calculations with Linear System Size Scaling. J. Chem. Phys. 1996, 104, 6643-6649. 36. Dixon, S. L.; Merz, K. M. Fast, Accurate Semiempirical Molecular Orbital Calculations for Macromolecules. J. Chem. Phys. 1997, 107, 879-893. 37. Stewart, J. J. P. Application of the PM6 Method to Modeling Proteins. J. Mol. Model. 2009, 15, 765-805. 38. Canfield, P.; Dahlbom, M. G.; Hush, N. S.; Reimers, J. R. Density-Functional Geometry Optimization of the 150 000-Atom Photosystem-I Trimer. J. Chem. Phys. 2006, 124, 024301. 39. Yin, S.; Dahlbom, M. G.; Canfield, P. J.; Hush, N. S.; Kobayashi, R.; Reimers, J. R. Assignment of the Q(Y) Absorption Spectrum of Photosystem-I from Thermosynechococcus Elongatus Based on CAM-B3LYP Calculations at the PW91-Optimized Protein Structure. J. Phys. Chem. B 2007, 111, 9923-9930. 40. König, C.; Neugebauer, J. Protein Effects on the Optical Spectrum of the Fenna- Matthews-Olson Complex from Fully Quantum Chemical Calculations. J. Chem. Theory Comput. 2013, 9, 1808-1820. 41. Antony, J.; Grimme, S. Fully Ab Initio Protein-Ligand Interaction Energies with Dispersion Corrected Density Functional Theory. J. Comput. Chem. 2012, 33, 1730-1739. 42. Kruse, H.; Grimme, S. A Geometrical Correction for the Inter- and Intra-Molecular Basis Set Superposition Error in Hartree-Fock and Density Functional Theory Calculations for Large Systems. J. Chem. Phys. 2012, 136, 154101. 43. Sure, R.; Grimme, S. Corrected Small Basis Set Hartree-Fock Method for Large Systems. J. Comput. Chem. 2013, 34, 1672-1685. 44. Fedorov, D. G.; Alexeev, Y.; Kitaura, K. Geometry Optimization of the Active Site of a Large System with the Fragment Molecular Orbital Method. J. Phys. Chem. Lett. 2011, 2, 282- 288. 45. Fedorov, D. G.; Kitaura, K. Extending the Power of Quantum Chemistry to Large Systems with the Fragment Molecular Orbital Method. J. Phys. Chem. A 2007, 111, 6904-6914. 46. Fedorov, D. G.; Ishida, T.; Uebayasi, M.; Kitaura, K. The Fragment Molecular Orbital Method for Geometry Optimizations of Polypeptides and Proteins. J. Phys. Chem. A 2007, 111, 2722-2732. 47. Nagata, T.; Brorsen, K.; Fedorov, D. G.; Kitaura, K.; Gordon, M. S. Fully Analytic Energy Gradient in the Fragment Molecular Orbital Method. J. Chem. Phys. 2011, 134, 124115. 48. Nagata, T.; Fedorov, D. G.; Sawada, T.; Kitaura, K.; Gordon, M. S. A Combined Effective Fragment Potential-Fragment Molecular Orbital Method. II. Analytic Gradient and Application to the Geometry Optimization of Solvated Tetraglycine and Chignolin. J. Chem. Phys. 2011, 134, 034110.

41 49. Pruitt, S. R.; Steinmann, C.; Jensen, J. H.; Gordon, M. S. Fully Integrated Effective Fragment Molecular Orbital Method. J. Chem. Theory Comput. 2013, 9, 2235-2249. 50. Lee, T. S.; Lewis, J. P.; Yang, W. T. Linear-Scaling Quantum Mechanical Calculations of Biological Molecules: The Divide-and-Conquer Approach. Comput. Mater. Sci. 1998, 12, 259- 277. 51. Yang, W.; Lee, T. S. A Density‐Matrix Divide‐and‐Conquer Approach for Electronic Structure Calculations of Large Molecules. J. Chem. Phys. 1995, 103, 5674-5678. 52. Gogonea, V.; Westerhoff, L. M.; Merz, K. M. Quantum Mechanical/Quantum Mechanical Methods. I. A Divide and Conquer Strategy for Solving the Schrödinger Equation for Large Molecular Systems Using a Composite Density Functional–Semiempirical Hamiltonian. J. Chem. Phys. 2000, 113, 5604-5613. 53. Cankurtaran, B. O.; Gale, J. D.; Ford, M. J. First Principles Calculations Using Density Matrix Divide-and-Conquer within the Siesta Methodology. J. Phys.: Condens. Matter 2008, 20, 294208. 54. He, X.; Merz, K. M. Divide and Conquer Hartree-Fock Calculations on Proteins. J. Chem. Theory Comput. 2010, 6, 405-411. 55. White, C. A.; Johnson, B. G.; Gill, P. M. W.; Head-Gordon, M. Linear Scaling Density Functional Calculations via the Continuous Fast Multipole Method. Chem. Phys. Lett. 1996, 253, 268-278. 56. Wada, M.; Sakurai, M. A Quantum Chemical Method for Rapid Optimization of Protein Structures. J. Comput. Chem. 2005, 26, 160-168. 57. Mayhall, N. J.; Raghavachari, K. Molecules-in-Molecules: An Extrapolated Fragment- Based Approach for Accurate Calculations on Large Molecules and Materials. J. Chem. Theory Comput. 2011, 7, 1336-1343. 58. Kussmann, J.; Beer, M.; Ochsenfeld, C. Linear-Scaling Self-Consistent Field Methods for Large Molecules. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2013, 3, 614-636. 59. Exner, T. E.; Mezey, P. G. Ab Initio-Quality Electrostatic Potentials for Proteins: An Application of the ADMA Approach. J. Phys. Chem. A 2002, 106, 11791-11800. 60. He, X.; Zhang, J. Z. H. A New Method for Direct Calculation of Total Energy of Protein. J. Chem. Phys. 2005, 122, 031103. 61. Rudberg, E.; Rubensson, E. H.; Sałek, P. Kohn−Sham Density Functional Theory Electronic Structure Calculations with Linearly Scaling Computational Time and Memory Usage. J. Chem. Theory Comput. 2010, 7, 340-350. 62. Stewart, J. J. P. Application of Localized Molecular Orbitals to the Solution of Semiempirical Self-Consistent Field Equations. Int. J. Quantum Chem 1996, 58, 133-146. 63. Kulik, H. J.; Luehr, N.; Ufimtsev, I. S.; Martinez, T. J. Ab Initio Quantum Chemistry for Protein Structures. J. Phys. Chem. B 2012, 116, 12501-12509. 64. Miao, Y.; Merz, K. M. Acceleration of Electron Repulsion Integral Evaluation on Graphics Processing Units via Use of Recurrence Relations. J. Chem. Theory Comput. 2013, 9, 965-976. 65. Andrade, X.; Aspuru-Guzik, A. Real-Space Density Functional Theory on Graphical Processing Units: Computational Approach and Comparison to Gaussian Basis Set Methods. J. Chem. Theory Comput. 2013, 9, 4360-4373. 66. Steinbrecher, T.; Elstner, M., QM and QM/MM Simulations of Proteins. In Biomolecular Simulations, Monticelli, L.; Salonen, E., Eds. Humana Press: 2013; Vol. 924, pp 91-124.

42 67. Gindulyte, A.; Bashan, A.; Agmon, I.; Massa, L.; Yonath, A.; Karel, J. The Transition State for Formation of the in the Ribosome. Proc. Natl. Acad. Sci. U.S.A. 2006, 103, 13327-13332. 68. Wang, J.; Dauter, M.; Alkire, R.; Joachimiak, A.; Dauter, Z. Triclinic Lysozyme at 0.65 Å Resolution. Acta Crystallogr. Sect. D. Biol. Crystallogr. 2007, 63, 1254-1268. 69. Falklöf, O.; Collyer, C. A.; Reimers, J. R. Toward Ab Initio Refinement of Protein X-Ray Crystal Structures: Interpreting and Correlating Structural Fluctuations. Theor. Chem. Acc. 2012, 131, 1076. 70. Rupp, B., Biomolecular Crystallography: Principles, Practice, and Application to Structural Biology. Garland Science: New York, 2010. 71. Brunger, A. T. Free R Value: A Novel Statistical Quantity for Assessing the Accuracy of Crystal Structures. Nature 1992, 355, 472-475. 72. Hohenberg, P.; Kohn, W. Inhomogeneous Electron Gas. Phys. Rev. B 1964, 136, B864. 73. Kohn, W.; Sham, L. J. Self-Consistent Equations Including Exchange and Correlation Effects. Phys. Rev. 1965, 140, A1133-A1138. 74. Becke, A. D. Density-Functional Exchange-Energy Approximation with Correct Asymptotic-Behavior. Phys. Rev. A 1988, 38, 3098-3100. 75. Perdew, J. P.; Yue, W. Accurate and Simple Density Functional for the Electronic Exchange Energy - Generalized Gradient Approximation. Phys. Rev. B 1986, 33, 8800-8802. 76. Becke, A. D. Density-Functional Thermochemistry .3. The Role of Exact Exchange. J. Chem. Phys. 1993, 98, 5648-5652. 77. Stephens, P. J.; Devlin, F. J.; Chabalowski, C. F.; Frisch, M. J. Ab-Initio Calculation of Vibrational Absorption and Circular-Dichroism Spectra Using Density-Functional Force-Fields. J. Phys. Chem. 1994, 98, 11623-11627. 78. Hehre, W. J.; Ditchfield, R.; Pople, J. A. Self-Consistent Molecular-Orbital Methods .12. Further Extensions of Gaussian-Type Basis Sets for Use in Molecular-Orbital Studies of Organic-Molecules. J. Chem. Phys. 1972, 56, 2257-&. 79. Schäfer, A.; Horn, H.; Ahlrichs, R. Fully Optimized Contracted Gaussian-Basis Sets for Atoms Li to Kr. J. Chem. Phys. 1992, 97, 2571-2577. 80. Weigend, F.; Ahlrichs, R. Balanced Basis Sets of Split Valence, Triple Zeta Valence and Quadruple Zeta Valence Quality for H to Rn: Design and Assessment of Accuracy. Phys. Chem. Chem. Phys. 2005, 7, 3297-3305. 81. Perez-Jorda, J. M.; Becke, A. D. A Density-Functional Study of Van-Der-Waals Forces - Rare-Gas Diatomics. Chem. Phys. Lett. 1995, 233, 134-137. 82. Kristyan, S.; Pulay, P. Can (Semi)Local Density-Functional Theory Account for the London Dispersion Forces. Chem. Phys. Lett. 1994, 229, 175-180. 83. Hobza, P.; Sponer, J.; Reschel, T. Density-Functional Theory and Molecular Clusters. J. Comput. Chem. 1995, 16, 1315-1325. 84. Sponer, J.; Leszczynski, J.; Hobza, P. Base Stacking in Cytosine Dimer. A Comparison of Correlated Ab Initio Calculations with Three Empirical Potential Models and Density Functional Theory Calculations. J. Comput. Chem. 1996, 17, 841-850. 85. Dion, M.; Rydberg, H.; Schroder, E.; Langreth, D. C.; Lundqvist, B. I. Van Der Waals Density Functional for General Geometries. Phys. Rev. Lett. 2004, 92, 246401. 86. Vydrov, O. A.; Van Voorhis, T. Nonlocal Van Der Waals Density Functional Made Simple. Phys. Rev. Lett. 2009, 103, 063004.

43 87. Lee, K.; Murray, E. D.; Kong, L.; Lundqvist, B. I.; Langreth, D. C. Higher-Accuracy Van Der Waals Density Functional. Phys. Rev. B 2010, 82, 081101. 88. Vydrov, O. A.; Van Voorhis, T. Nonlocal Van Der Waals Density Functional: The Simpler the Better. J. Chem. Phys. 2010, 133, 244103. 89. Hujo, W.; Grimme, S. Performance of the Van Der Waals Density Functional VV10 and (Hybrid)GGA Variants for Thermochemistry and Noncovalent Interactions. J. Chem. Theory Comput. 2011, 7, 3866-3871. 90. Grimme, S. Accurate Description of Van Der Waals Complexes by Density Functional Theory Including Empirical Corrections. J. Comput. Chem. 2004, 25, 1463-1473. 91. Grimme, S. Semiempirical GGA-Type Density Functional Constructed with a Long- Range Dispersion Correction. J. Comput. Chem. 2006, 27, 1787-1799. 92. von Lilienfeld, O. A.; Tavernelli, I.; Rothlisberger, U.; Sebastiani, D. Optimization of Effective Atom Centered Potentials for London Dispersion Forces in Density Functional Theory. Phys. Rev. Lett. 2004, 93, 153004. 93. Zhao, Y.; Truhlar, D. G. The M06 Suite of Density Functionals for Main Group Thermochemistry, Thermochemical Kinetics, Noncovalent Interactions, Excited States, and Transition Elements: Two New Functionals and Systematic Testing of Four M06-Class Functionals and 12 Other Functionals. Theor. Chem. Acc. 2008, 120, 215-241. 94. Becke, A. D.; Johnson, E. R. Exchange-Hole Dipole Moment and the Dispersion Interaction. J. Chem. Phys. 2005, 122, 154104. 95. Steinmann, S. N.; Corminboeuf, C. Comprehensive Benchmarking of a Density- Dependent Dispersion Correction. J. Chem. Theory Comput. 2011, 7, 3567-3577. 96. Tkatchenko, A.; Scheffler, M. Accurate Molecular Van Der Waals Interactions from Ground-State Electron Density and Free-Atom Reference Data. Phys. Rev. Lett. 2009, 102, 073005. 97. Torres, E.; DiLabio, G. A. A (Nearly) Universally Applicable Method for Modeling Noncovalent Interactions Using B3LYP. J. Phys. Chem. Lett. 2012, 3, 1738-1744. 98. Grimme, S.; Antony, J.; Ehrlich, S.; Krieg, H. A Consistent and Accurate Ab Initio Parametrization of Density Functional Dispersion Correction (DFT-D) for the 94 Elements H-Pu. J. Chem. Phys. 2010, 132, 154104. 99. Grimme, S.; Ehrlich, S.; Goerigk, L. Effect of the Damping Function in Dispersion Corrected Density Functional Theory. J. Comput. Chem. 2011, 32, 1456-1465. 100. Grimme, S. Density Functional Theory with London Dispersion Corrections. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2011, 1, 211-228. 101. Klimes, J.; Michaelides, A. Perspective: Advances and Challenges in Treating Van Der Waals Dispersion Forces in Density Functional Theory. J. Chem. Phys. 2012, 137, 120901. 102. Goerigk, L. How Do DFT-DCP, DFT-NL, and DFT-D3 Compare for the Description of London-Dispersion Effects in Conformers and General Thermochemistry? J. Chem. Theory Comput. 2014, 10, 968-980. 103. Goerigk, L.; Kruse, H.; Grimme, S. Benchmarking Density Functional Methods against the S66 and S66x8 Datasets for Non-Covalent Interactions. Chem. Phys. Chem. 2011, 12, 3421- 3433. 104. Ehrlich, S.; Moellmann, J.; Grimme, S. Dispersion-Corrected Density Functional Theory for Aromatic Interactions in Complex Systems. Acc. Chem. Res. 2013, 46, 916-926.

44 105. Risthaus, T.; Grimme, S. Benchmarking of London Dispersion-Accounting Density Functional Theory Methods on Very Large Molecular Complexes. J. Chem. Theory Comput. 2013, 9, 1580-1591. 106. Goerigk, L.; Karton, A.; Martin, J. M. L.; Radom, L. Accurate Quantum Chemical Energies for Tetrapeptide Conformations: Why MP2 Data with an Insufficient Basis Set Should Be Handled with Caution. Phys. Chem. Chem. Phys. 2013, 15, 7028-7031. 107. Grimme, S.; Steinmetz, M. Effects of London Dispersion Correction in Density Functional Theory on the Structures of Organic Molecules in the Gas Phase. Phys. Chem. Chem. Phys. 2013, 15, 16031-16042. 108. Goerigk, L.; Grimme, S. A Thorough Benchmark of Density Functional Methods for General Main Group Thermochemistry, Kinetics, and Noncovalent Interactions. Phys. Chem. Chem. Phys. 2011, 13, 6670-6688. 109. Grimme, S.; Huenerbein, R.; Ehrlich, S. On the Importance of the Dispersion Energy for the Thermodynamic Stability of Molecules. Chem. Phys. Chem. 2011, 12, 1258-1261. 110. Goerigk, L.; Falklöf, O.; Collyer, C. A.; Reimers , J. R., First Steps Towards Quantum Refinement of Protein X-Ray Structures. In Quantum Simulations of Materials and Biological Systems, Zeng, J.; Zhang, R. Q.; Treutlein, H. R., Eds. Springer: Dordrecht, 2012; pp 87-120. 111. Hepburn, J.; Scoles, G.; Penco, R. A Simple but Reliable Method for the Prediction of Intermolecular Potentials. Chem. Phys. Lett. 1975, 36, 451-456. 112. Ahlrichs, R. Intermolecular Forces in Simple Systems. Chem. Phys. 1977, 19, 119-130. 113. Roskop, L.; Fedorov, D. G.; Gordon, M. S. Diffusion Energy Profiles in Silica Mesoporous Molecular Sieves Modelled with the Fragment Molecular Orbital Method. Mol. Phys. 2013, 111, 1622-1629. 114. Brandenburg, J.; Grimme, S. Dispersion Corrected Hartree–Fock and Density Functional Theory for Organic Crystal Structure Prediction. Top. Curr. Chem. 2014, 345, 1-23. 115. Boys, S. F.; Bernardi, F. Calculation of Small Molecular Interactions by Differences of Separate Total Energies - Some Procedures with Reduced Errors. Mol. Phys. 1970, 19, 553-566. 116. Mentel, Ł. M.; Baerends, E. J. Can the Counterpoise Correction for Basis Set Superposition Effect Be Justified? J. Chem. Theory Comput. 2013, 10, 252-267. 117. Valdes, H.; Klusak, V.; Pitonak, M.; Exner, O.; Stary, I.; Hobza, P.; Rulisek, L. Evaluation of the Intramolecular Basis Set Superposition Error in the Calculations of Larger Molecules: n Helicenes and Phe-Gly-Phe Tripeptide. J. Comput. Chem. 2008, 29, 861-870. 118. Moran, D.; Simmonett, A. C.; Leach, F. E.; Allen, W. D.; Schleyer, P. v. R.; Schaefer, H. F. Popular Theoretical Methods Predict Benzene and Arenes to Be Nonplanar. J. Am. Chem. Soc. 2006, 128, 9342-9343. 119. Toroz, D.; Van Mourik, T. The Structure of the Gas-Phase Tyrosine-Glycine Dipeptide. Mol. Phys. 2006, 104, 559-570. 120. van Mourik, T.; Karamertzanis, P. G.; Price, S. L. Molecular Conformations and Relative Stabilities Can Be as Demanding of the Electronic Structure Method as Intermolecular Calculations. J. Phys. Chem. A 2006, 110, 8-12. 121. Holroyd, L. F.; van Mourik, T. Insufficient Description of Dispersion in B3LYP and Large Basis Set Superposition Errors in MP2 Calculations Can Hide Peptide Conformers. Chem. Phys. Lett. 2007, 442, 42-46. 122. Asturiol, D.; Duran, M.; Salvador, P. Intramolecular Basis Set Superposition Error Effects on the Planarity of Benzene and Other Aromatic Molecules: A Solution to the Problem. J. Chem. Phys. 2008, 128, 144108.

45 123. Jensen, F. An Atomic Counterpoise Method for Estimating Inter- and Intramolecular Basis Set Superposition Errors. J. Chem. Theory Comput. 2010, 6, 100-106. 124. Kruse, H.; Goerigk, L.; Grimme, S. Why the Standard B3LYP/6-31G* Model Chemistry Should Not Be Used in DFT Calculations of Molecular Thermochemistry: Understanding and Correcting the Problem. J. Org. Chem. 2012, 77, 10824-10834. 125. Brandenburg, J. G.; Alessio, M.; Civalleri, B.; Peintinger, M. F.; Bredow, T.; Grimme, S. Geometrical Correction for the Inter- and Intramolecular Basis Set Superposition Error in Periodic Density Functional Theory Calculations. J. Phys. Chem. A 2013, 117, 9282-9292. 126. Goerigk, L.; Reimers, J. R. Efficient Methods for the Quantum Chemical Treatment of Protein Structures: The Effects of London-Dispersion and Basis-Set Incompleteness on Peptide and Water-Cluster Geometries. J. Chem. Theory Comput. 2013, 9, 3240-3251. 127. Zhao, Y.; Truhlar, D. G. Design of Density Functionals That Are Broadly Accurate for Thermochemistry, Thermochemical Kinetics, and Nonbonded Interactions. J. Phys. Chem. A 2005, 109, 5656-5667. 128. Fleming, A. On a Remarkable Bacteriolytic Element Found in Tissues and Secretions. Proc. R. Soc. London B Bio. 1922, 93, 306-317. 129. Blake, C. C. F.; Poljak, R. J.; Fenn, R. H.; North, A. C. T.; Phillips, D. C. A Fourier Map of Electron Density at 6 Å Resolution Obtained by X-Ray Diffraction. Nature 1962, 196, 1173- 1176. 130. Perdew, J. P.; Chevary, J. A.; Vosko, S. H.; Jackson, K. A.; Pederson, M. R.; Singh, D. J.; Fiolhais, C. Atoms, Molecules, Solids, and Surfaces - Applications of the Generalized Gradient Approximation for Exchange and Correlation. Phys. Rev. B 1992, 46, 6671-6687. 131. Dapprich, S.; Komaromi, I.; Byun, K. S.; Morokuma, K.; Frisch, M. J. A New ONIOM Implementation in Gaussian98. Part I. The Calculation of Energies, Gradients, Vibrational Frequencies and Electric Field Derivatives. J. Mol. Struc.-THEOCHEM 1999, 461, 1-21. 132. M. J. Frisch; G. W. Trucks; H. B. Schlegel; G. E. Scuseria; M. A. Robb; J. R. Cheeseman; G. Scalmani; V. Barone; B. Mennucci; G. A. Petersson et al., Gaussian 09, Revision C.01, Gaussian, Inc.: Wallingford CT, 2010. 133. Ahlrichs, R.; Bar, M.; Haser, M.; Horn, H.; Kolmel, C. Electronic-Structure Calculations on Workstation Computers - the Program System Turbomole. Chem. Phys. Lett. 1989, 162, 165- 169. 134. Furche, F.; Ahlrichs, R.; Hattig, C.; Klopper, W.; Sierka, M.; Weigend, F. Turbomole. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2014, 4, 91-100. 135. S. Grimme' S Group Website. http://www.thch.uni-bonn.de/tc/downloads/gcp/index.html (accessed 1 October 2014). 136. Cornell, W. D.; Cieplak, P.; Bayly, C. I.; Gould, I. R.; Merz, K. M.; Ferguson, D. M.; Spellmeyer, D. C.; Fox, T.; Caldwell, J. W.; Kollman, P. A. A 2nd Generation Force-Field for the Simulation of Proteins, Nucleic-Acids, and Organic-Molecules. J. Am. Chem. Soc. 1995, 117, 5179-5197. 137. Lee, C. T.; Yang, W. T.; Parr, R. G. Development of the Colle-Salvetti Correlation- Energy Formula into a Functional of the Electron-Density. Phys. Rev. B 1988, 37, 785-789. 138. Perdew, J. P.; Burke, K.; Ernzerhof, M. Generalized Gradient Approximation Made Simple. Phys. Rev. Lett. 1996, 77, 3865-3868. 139. Tao, J. M.; Perdew, J. P.; Staroverov, V. N.; Scuseria, G. E. Climbing the Density Functional Ladder: Nonempirical Meta-Generalized Gradient Approximation Designed for Molecules and Solids. Phys. Rev. Lett. 2003, 91, 146401.

46 140. Eichkorn, K.; Treutler, O.; Ohm, H.; Häser, M.; Ahlrichs, R. Auxiliary Basis-Sets to Approximate Coulomb Potentials. Chem. Phys. Lett. 1995, 240, 283-289. 141. Eichkorn, K.; Weigend, F.; Treutler, O.; Ahlrichs, R. Auxiliary Basis Sets for Main Row Atoms and Transition Metals and Their Use to Approximate Coulomb Potentials. Theor. Chem. Acc. 1997, 97, 119-124. 142. Rudberg, E. Difficulties in Applying Pure Kohn–Sham Density Functional Theory Electronic Structure Methods to Protein Molecules. J. Phys.: Condens. Matter 2012, 24, 072202. 143. Klamt, A.; Schuurmann, G. Cosmo - a New Approach to Dielectric Screening in Solvents with Explicit Expressions for the Screening Energy and Its Gradient. J. Chem. Soc. Perk. Trans. 2 1993, 799-805. 144. Wick, C.; Hennemann, M.; Stewart, J. P.; Clark, T. Self-Consistent Field Convergence for Proteins: A Comparison of Full and Localized-Molecular-Orbital Schemes. J. Mol. Model. 2014, 20, 2159. 145. Karplus, P. A.; Diederichs, K. Linking Crystallographic Model and Data Quality. Science 2012, 336, 1030-1033. 146. Murshudov, G. N.; Vagin, A. A.; Dodson, E. J. Refinement of Macromolecular Structures by the Maximum-Likelihood Method. Acta Crystallogr. Sect. D. Biol. Crystallogr. 1997, 53, 240-255. 147. Murshudov, G. N.; Skubak, P.; Lebedev, A. A.; Pannu, N. S.; Steiner, R. A.; Nicholls, R. A.; Winn, M. D.; Long, F.; Vagin, A. A. Refmac5 for the Refinement of Macromolecular Crystal Structures. Acta Crystallogr. Sect. D. Biol. Crystallogr. 2011, 67, 355-367. 148. Weigend, F. A Fully Direct RI-HF Algorithm: Implementation, Optimised Auxiliary Basis Sets, Demonstration of Accuracy and Efficiency. Phys. Chem. Chem. Phys. 2002, 4, 4285- 4291. 149. Neese, F.; Wennmohs, F.; Hansen, A.; Becker, U. Efficient, Approximate and Parallel Hartree–Fock and Hybrid DFT Calculations. A ‘Chain-of-Spheres’ Algorithm for the Hartree– Fock Exchange. Chem. Phys. 2009, 356, 98-109. 150. Neese, F. The Orca Program System. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2012, 2, 73-78.

47 For Table of Contents use only “Recommending Hartree-Fock Theory with London-Dispersion and Basis-Set- Superposition Corrections for the Optimization or Quantum Refinement of Protein Structures” Lars Goerigk*, Charles A. Collyer, and Jeffrey R. Reimers*

48

Minerva Access is the Institutional Repository of The University of Melbourne

Author/s: Goerigk, L; Collyer, CA; Reimers, JR

Title: Recommending Hartree-Fock Theory with London-Dispersion and Basis-Set-Superposition Corrections for the Optimization or Quantum Refinement of Protein Structures

Date: 2014-12-18

Citation: Goerigk, L., Collyer, C. A. & Reimers, J. R. (2014). Recommending Hartree-Fock Theory with London-Dispersion and Basis-Set-Superposition Corrections for the Optimization or Quantum Refinement of Protein Structures. JOURNAL OF PHYSICAL CHEMISTRY B, 118 (50), pp.14612-14626. https://doi.org/10.1021/jp510148h.

Persistent Link: http://hdl.handle.net/11343/57002

File Description: Accepted version