A Critical Note on Symmetry Contact Artifacts and the Evaluation of the Quality of Homology Models
Total Page:16
File Type:pdf, Size:1020Kb
S S symmetry Article A Critical Note on Symmetry Contact Artifacts and the Evaluation of the Quality of Homology Models Dipali Singh 1,†, Karen R. M. Berntsen 2, Coos Baakman 2, Gert Vriend 2,* and Tapobrata Lahiri 1 1 Bioinformatics Division, Indian Institute of Information Technology, Allahabad 211012, India; [email protected] (D.S.); [email protected] (T.L.) 2 CMBI, Radboud University Nijmegen Medical Centre, 6525 GA 26-28 Nijmegen, The Netherlands; [email protected] (K.R.M.B.); [email protected] (C.B.) * Correspondence: [email protected] † Present address: Institute for Computer Science and Department of Biology, Heinrich Heine University, 40225 Düsseldorf, Germany. Received: 20 November 2017; Accepted: 2 January 2018; Published: 11 January 2018 Abstract: It is much easier to determine a protein’s sequence than to determine its three dimensional structure and consequently homology modeling will be an essential aspect of most studies that require 3D protein structure data. Homology modeling templates tend to be PDB files. About 88% of all protein structures in the PDB have been determined with X-ray crystallography, and thus are based on crystals that by necessity hold non-natural packing contacts in accordance with the crystal symmetry. Active site residues, residues involved in intermolecular interactions, residues that get post-translationally modified, or other sites of interest, normally are located at the protein surface so that it is particularly important to correctly model surface-located residues. Unfortunately, surface residues are just those that suffer most from crystal packing artifacts. Our study of the influence of crystal packing artifacts on the quality of homology models reveals that this influence is much larger than generally assumed, and that the evaluation of the quality of homology models should properly account for these artifacts. Keywords: homology modeling; symmetry contact; side chain rotamericity 1. Introduction Knowledge of the three dimensional structure of proteins is a prerequisite for rational drug design, for many forms of protein engineering, or for explaining the molecular phenotype associated with the disease phenotype caused by a mutation in the human genome. It is considerably easier to determine the sequence of a protein than it is to determine its structure, and consequently homology modeling will be an essential aspect of most studies that require protein structure data. The homology modeling community is continuously working on improving all aspects of the process, and the bi-annual CASP ‘competition’ [1–9] is a good benchmark for where the field stands. The root mean square deviation of the atomic positions in a homology model from the equivalent atoms in the corresponding real structure after they have been optimally superposed is an important aspect of the evaluation of the quality of homology modeling procedures [10–12]. Selecting correct rotamers [5–8] is another aspect. Other measures have also been proposed [13–19]. The expected frequency of occurrence of a molecule or residue in a certain conformation is exponentially related to the energy calculated for that conformation. This is a direct result of application of the Boltzmann law [20]. A frequency plot will thus have much sharper peaks than an energy plot. Similar reasoning suggests that amino acids will prefer to have their χ1 torsion angle rather close to +60◦, 180◦, and −60◦. Many articles have been written on this topic e.g., [21–26] so that the valine χ1 frequency distribution plot, shown as an example in Figure1, does not come as a surprise. Symmetry 2018, 10, 25; doi:10.3390/sym10010025 www.mdpi.com/journal/symmetry Symmetry 2018, 10, 25 2 of 13 Symmetry 2018, 10, 25 2 of 13 The relations between the backbone structure and the preferred side chain rotamer have long been The relations between the backbone structure and the preferred side chain rotamer have long knownbeen known [27], and [27], a series and a of series rotamer of rotamer libraries libraries have been have designed been designed over the over years; the years; most oftenmost often to improve to homologyimprove modelinghomology [modeling28–38]. Rotamers [28–38]. Rotamers are also ofare great also importanceof great importance when analyzing when analyzing the quality the of experimentallyquality of experimentally determined determined protein structures protein structures and homology and homology models. mode Thels. rotameric The rotameric state state of side chainsof side has chains often has been often used been as aused quality as a measurequality measure of homology of homology models models [5–8,39 [5–8,39].]. Most Most studies studies suggest thatsuggest rotamer that distributions rotamer distributions are a function are a offunction secondary of secondary structure, structure, to a lesser to extend a lesser of extend the accessibility, of the andaccessibility, barely of the and absence barely or of presence the absence of symmetry or presence contacts. of symmetry However, ifcontacts. two distributions However, areif two similar, theydistributions do not need are to similar, reflect they the samedo not underlying need to reflect data. the same underlying data. FigureFigure 1. 1.Frequency Frequency distribution of of valine valine in ina st arand strand as function as function of the of torsion the torsion angle χ angle1. Theχ red1. Theline red linerepresents represents actual actual frequencies frequencies observed observed in a subset in a of subset PDB files of PDB solved files by solvedcrystallography by crystallography at 1.8 Å resolution or better (with an R-factor of 0.18 or better) that was culled at 90% sequence identity. 26,028 at 1.8 Å resolution or better (with an R-factor of 0.18 or better) that was culled at 90% sequence identity. valines contributed to the red line. The black line represents the negated exponential of the energy as 26,028 valines contributed to the red line. The black line represents the negated exponential of the calculated with Gamess-US [40] using the 6-31g** basis set and the DFT b3lip energy model. Vertical energy as calculated with Gamess-US [40] using the 6-31g** basis set and the DFT b3lip energy model. scales are arbitrary. The small differences in the locations of the maxima of the g+ and g− peaks are Vertical scales are arbitrary. The small differences in the locations of the maxima of the g+ and g− peaks caused by the fact that the experimentally observed valines are found in-between other residues so are caused by the fact that the experimentally observed valines are found in-between other residues so that also 1-5 repulsive forces act on the Cγ atoms while the predicted distribution is calculated for an that also 1-5 repulsive forces act on the Cγ atoms while the predicted distribution is calculated for an isolated valine in vacuum. isolated valine in vacuum. Figure 2 shows, as an example, the rotamer distributions for a few amino acid types in an α- helix,Figure a β-strand,2 shows, or as a loop an example, region. In the most rotamer cases distributionsthe g− conformation for a few is preferred. amino acid The types three in plots an α for-helix, a βisoleucine-strand, orare a shown loop region.because Inthey most are representative cases the g− forconformation the whole set is of preferred. 51 plots (Gly, The Ala, three and plots Pro for isoleucinedo not have are a shown meaningful because χ1 angle). they are Phenylalanine representative in a for β-strand the whole and sethistidine of 51 plotsin an (Gly,α-helix Ala, are and shown Pro do notbecause have a their meaningful curves differχ1 angle). most from Phenylalanine all other pl inots. a βThe-strand general and absence histidine of significant in an α-helix differences are shown becausebetween their the curvessubsets differof the data most as from a function all other of eith plots.er solvent The general accessibility absence or the of presence/absence significant differences of symmetry contacts seem to suggests that this factor is not important. between the subsets of the data as a function of either solvent accessibility or the presence/absence of 88% of the protein structures in the PDB have been elucidated with X-ray crystallography, 11% symmetry contacts seem to suggests that this factor is not important. with NMR, and about 1% with other techniques. Structures solved by X-ray tend to be more accurate. 88% of the protein structures in the PDB have been elucidated with X-ray crystallography, NMR, on the other hand, often gives a better impression of any mobility present in the structure, and 11%it does with not NMR, suffer, and from about artifacts 1% withcaused other by crystal techniques. packing Structures contacts. solved by X-ray tend to be more accurate.Table NMR, 1 shows on thethat otheron average hand, about often one-fifth gives a of better the amino impression acids at of the any surface mobility of a presentprotein are in the structure,involved and in crystal it does packing. not suffer, Table from 1 artifactsalso shows caused that not by crystalall residue packing types contacts. are equally likely to be involvedTable1 in shows crystal that packing on average contacts, about but that one-fifth study ofis beyond the amino the scope acids of at this the surfacearticle. of a protein are involved in crystal packing. Table1 also shows that not all residue types are equally likely to be involved in crystal packing contacts, but that study is beyond the scope of this article. Symmetry 2018, 10, 25 3 of 13 Symmetry 2018, 10, 25 3 of 13 FigureFigure 2. χ 12. frequency χ1 frequency plots plots for for residues residues with or or without without symmetry symmetry contacts contacts (S+ or (S+ S−, orrespectively) S−, respectively) and withand variouswith various solvent solvent accessibilities.