Protein Structure Prediction
Total Page:16
File Type:pdf, Size:1020Kb
Protein structure prediction Zinc-binding sites, one-dimensional structure and remote homology Nanjiang Shu Doctoral Thesis Department of Physical, Inorganic, and Structural Chemistry (Division of Structural Chemistry) Stockholm University Stockholm 2010 Doctoral Dissertation 2010 Structural Chemistry Arrhenius Laboratory Stockholm University S-106 91 Stockholm Sweden Faculty opponent: • Professor Geoff Barton, School of Life Sciences, University of Dundee, Dundee, UK Evaluation committee: • Professor Elisabeth Sauer-Eriksson, Umeå University, Sweden • Associate Professor Erik Lindahl, Stockholm University, Sweden • Professor Hans Ågren, KTH, Sweden • Reserv: Professor Jozef Kowalewski, Stockholm University, Sweden © Nanjiang Shu, Stockholm 2010 ISBN: 978-91-7155-984-5 Printed in Sweden by US-AB The cover image is modified from Figure 1 and Figure 6 in the thesis II To my beloved family III Abstract Predicting the three-dimensional (3D) structure of proteins is a central problem in biology. These computationally predicted 3D protein structures have been successfully applied in many fields of biomedicine, e.g. family assignments and drug discovery. The accurate detection of remotely homologous templates is critical for the successful prediction of the 3D structure of proteins. Also, the prediction of one-dimensional (1D) protein structures such as secondary structures and shape strings are useful for predicting the 3D structure of proteins and important for understanding the sequence-structure relationship. In addition, the prediction of the functional sites of proteins, such as metal-binding sites, can not only reveal the important function of proteins (even in the absence of the 3D structure) but also facilitate the prediction of the 3D structure. Here, three novel methods in the field of protein structure prediction are presented: PREDZINC, a method for predicting zinc-binding sites in proteins; Frag1D, a method for predicting the 1D structure of proteins; and FragMatch, a method for detecting remotely homologous proteins. These methods compete satisfactorily with the best methods previously published and contribute to the task of protein structure prediction. V List of publications I. Shu, N. , Zhou, T. and Hovmöller, S. (2008) Prediction of zinc- binding sites in proteins from sequence, Bioinformatics , 24 , 775-782. II. Zhou, T.*, Shu, N. * and Hovmöller, S. (2009) A Novel method for accurate one-dimensional protein structure prediction based on fragment matching, Bioinformatics , btp679. [*Equal contribution] III. Shu, N. , Zhou, T. and Hovmöller, S. (2010) Protein homology detection by profile based fragment matching, in manuscript. IV. Shu, N. , Hovmöller, S. and Zhou, T. (2008) Describing and comparing protein structures using shape strings, Curr Protein Pept Sci , 9, 310-324. Papers I, II and IV are reprinted with permission from the publishers. VII Abbreviations 1D One-dimensional 3D Three-dimensional CASP Critical Assessment of protein Structure Prediction CATH A hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class (C), Architecture (A), Topology (T) and Homologous superfamily (H). CH Cysteine or histidine CHDE Cysteine, histidine, aspartate or glutamate CSA Catalytic Site Atlas DSSP Definition of the secondary structure of proteins HMM Hidden Markov Model HSSP Homology-derived structures of proteins NMR Nuclear Magnetic Resonance PDB Protein Data Bank PSSM Position Specific Substitution Matrix Q3 Overall per-residue accuracy for three-state secondary structure prediction ROC Receiver Operating Characteristic S3 Overall per-residue accuracy for three-state shape string prediction S8 Overall per-residue accuracy for eight-state shape string prediction SCOP Structural Classification of Proteins SOV Segment Overlap measure SSE Secondary Structure Element SVM Support Vector Machine VIII Contents ABSTRACT ............................................................................................................. V LIST OF PUBLICATIONS ....................................................................................... VII ABBREVIATIONS ................................................................................................ VIII 1 INTRODUCTION ................................................................................................1 1.1 Background: 3D protein structure prediction............................................3 1.2 Predicting 1D protein structures................................................................5 1.3 Detecting remote homologues...................................................................8 1.4 Predicting metal-binding sites in proteins.................................................9 1.5 Describing and comparing protein structures..........................................11 1.5.1 SCOP (Structural Classification of Proteins)...........................................12 1.5.2 CATH (Class, Architecture, Topology and Homologous superfamily) ...13 1.5.3 Comparing protein structures...................................................................14 2 METHODS AND MATERIALS ...........................................................................16 2.1 Properties of shape strings ......................................................................16 2.1.1 Definition of shape strings.......................................................................16 2.1.2 Statistics on shape strings ........................................................................18 2.2 Predicting zinc-binding sites in proteins .................................................21 2.2.1 Statistics and properties of zinc-binding sites in proteins ........................21 2.2.2 Method description for PREDZINC ........................................................26 2.3 Predicting the 1D structure of proteins ...................................................29 2.3.1 Data description.......................................................................................29 2.3.2 Method description for Frag1D................................................................30 2.4 Detecting remote homologues.................................................................35 2.4.1 Method description for FragMatch ..........................................................35 2.4.2 Dataset for evaluating FragMatch............................................................39 3 PERFORMANCE MEASUREMENT ....................................................................41 3.1 Cross-validation ......................................................................................41 3.2 Precision and recall.................................................................................41 3.3 Receiver operating characteristic (ROC) curve.......................................42 4 SUMMARY OF SCIENTIFIC CONTRIBUTIONS ..................................................43 4.1 Prediction of zinc-binding sites in proteins (Paper I)..............................43 4.2 Prediction of 1D protein structures (Paper II).........................................45 4.3 Remote homology detection (Paper III)..................................................47 4.4 Describing and comparing protein structures (Paper IV)........................49 5 CONCLUSIONS ................................................................................................51 ACKNOWLEDGEMENTS ........................................................................................53 6 APPENDICES ...................................................................................................55 6.1 Appendix 1: HSSP distance ....................................................................55 6.2 Appendix 2: vector encoding ..................................................................55 6.2.1 Encoding of single-site vectors................................................................55 6.2.2 Encoding of pair-based vectors................................................................56 6.3 Appendix 3: profiles ...............................................................................59 6.4 Appendix 4: dataset for benchmarking with PSIPRED ..........................61 REFERENCES ........................................................................................................63 IX 1 Introduction Proteins are essential to life. In bodies, proteins fold into certain three- dimensional (3D) structures, called the native structures. The functions of proteins rely on their native structures. Determining the 3D structure of proteins has become a major task for modern biological research. Protein structures are determined experimentally by X-ray crystallography, NMR spectroscopy and cryo-electro microscopy (cryo-EM). Since the determination of the first protein structure, myoglobin, by Kendrew and his colleagues 50 years ago (Kendrew et al. , 1958), the number of experimentally solved protein structures deposited in the Protein Data Bank (PDB, www.pdb.org/pdb ) (Berman et al. , 2000) has reached 56 951 (as of Nov. 18, 2009; there are also 4626 other biological macromolecular structures such as DNA in the PDB) and this number is still doubling about every three years. However, this exciting number can be disappointing for the biologists who need 3D models of proteins in their research. As of Nov. 2009, there are ~9.7 million protein sequences deposited in the UniProtKB/TrEMBL database (The-UniProt-Consortium, 2009). The chance of a protein sequence to have a solved structure