Protein Structure Prediction
Total Page:16
File Type:pdf, Size:1020Kb
Protein structure prediction Comparative/homology modeling What is the reason for protein structure prediction? 1. Solving protein structures experimentally is hard (sometimes impossible) 2. Many predicted structures can be close to solved structures in accuracy 3. Protein structure can provide important clues to protein function – Functional sites (e.g., enzyme active sites and specificity determinants) – Protein-protein interaction – Docking studies (e.g., for drug interaction/design studies) – See Baker and Sali paper for other uses 2 Assigned Reading for this section From Pevsner text: • Chapter 6 Multiple Sequence Alignment 205-227, 234 (Read-by: 9/27) • Chapter 13 Protein Structure 589-625 (Read-by: 9/29) “Protein Structure Prediction and Structural Genomics”, David Baker and Andrej Sali, Science 2001. (Read-by: 9/29) 3 Sources for this lecture Park et al, “Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.” JMB 1998 David Baker and Andrej Sali, “Protein Structure Prediction and Structural Genomics” Science 2001 Chothia and Lesk, “The relation between the divergence of sequence and structure in proteins”, EMBO Journal 1986 Andrej Sali and Andras Fiser – selected slides from their seminars Bioinformatics (Baxevanis and Ouellette, previous course text) Chapter 8: Predictive methods using protein sequences (Ofran and Rost) 198-219 Chapter 9: Protein structure prediction and analysis (Wishart) 224-247 Chapter 12: Creation and analysis of protein multiple sequence alignments (Barton) Topics Covered • Folding pathways • Primary, secondary, tertiary and quaternary protein structure • Secondary (2D) structure prediction • 3D fold prediction – Ab initio protein structure prediction (briefly) – Fold recognition (classification of an unknown protein to a fold potentially without constructing a comparative model) – Comparative model construction (aka homology model construction) • Community evaluation of protein structure prediction – Critical Assessment of protein Fold Prediction (CASP) http:// predictioncenter.org/ – EVA (real-time continuous evaluation of protein fold prediction methods) http://cubic.bioc.columbia.edu/eva/ – Astral datasets 5 • Structural Genomics Initiative The telescope: Protein structure prediction and comparison 15% identity between VirB4 & TrwB Protein structure prediction and VirB4 model TrwB PDB structure comparison provides a kind of Hubble telescope enabling distant homologies to be revealed 6 Biological background And major protein structure resources 7 Primary, Secondary, Tertiary and Quaternary Structure 8 Hierarchical descriptions of proteins (follows the folding process) • Primary structure: the amino acid sequence • Secondary structure: “regular local structure of linear segments of polypeptide chains” (Creighton) – Helix (~35% of residues): subtypes: α, π and 310 – Beta sheet (~25% of residues) – Both types predicted by Linus Pauling (Corey and Pauling, 1953; α helix first described by Pauling in 1951) – Other less common structures: • Beta turns • 3/10 helices • Ω loops – Remaining unclassifiable regions sometimes termed “random coil” or “unstructured regions” • Tertiary structure: “Overall topology of the folded polypeptide chain” (Creighton) – Mediated by hydrophobic interactions between distant parts of protein • Quaternary structure: “Aggregation of the separate polypeptide chains of a protein” (Creighton) 9 Baxevanis & Ouellette (Ch. 9, p.224, Wishart) Information required for folding is (mostly) contained in the primary sequence • Early on, proteins were shown to fold into their native structures in isolation • This led to the belief that structure is determined by sequence alone (Anfinsen, 1973) • Over the last decade, a significant number of proteins have been shown to not fold properly in the test tube (e.g., requiring the assistance of chaperonins) • Nevertheless, the native 3D structure is assumed to be in some energetic minimum • This led to the development of ab initio folding methods 10 Baxevanis & Ouellette (Ch. 9, Wishart) Folding pathways • Evidence that local structure segments form first, and then pack against each other to form 3D fold – Exploited in protein fold prediction, Rosetta method • Simons, Bonneau, Ruczinski & Baker (1999). Ab initio Protein Structure Prediction of CASP III Targets Using ROSETTA. Proteins • Semi-stable structural intermediates on folding pathway to lowest-energy conformation – Prof. Susan Marqusee, Berkeley 11 Baxevanis & Ouellette (Ch. 9, Wishart) Proteins can diverge structurally and functionally from a common ancestor 1AGT 1MYN Agitoxin 2 Egyptian Scorpion (K+ channel inhibitor) Drosomycin, Antifungal protein Fruit Fly SCOP Scorpion-toxin- related superfamily 1CN2 Toxin 2 Mexican scorpion (Na+ channel inhibitor) 1BK8 1AYJ Antimicrobial Protein 1 (Ah-Amp1) Antifungal protein 1 (RS-AFP1) Common horse chestnut Radish 12 Sequence and structural divergence are related “The relation between the divergence of sequence and structure in proteins”, Chothia and Lesk. EMBO Journal 1986 Structural alignment example ID EC Function 1E9Y 3.5.1.5 Urease 1J79 3.5.2.3 Dihydroorotase Identity 9.8% Equivalent Residues 40% 14 SCOP comparison of 1e9y and 1j79** 1j79 (c.1.9.4) and one domain of 1e9y (d1e9yb2: c.1.9.2) are placed in the same SCOP superfamily (c.1.9.*) 15 Sequence and structural divergence are correlated** Accuracy of sequence alignment relative to structural alignment Pairwise alignment MSA-pw Sequence-profile methods %ID #pair %Superpos BLAST ClustalW Tcoffee ClustalW MAFFT MUSCLE HMM TreeHMM TreeHMM-Opt >70 107 90.6 0.954 0.955 0.955 0.955 0.954 0.954 0.951 0.954 0.96 50-70 63 87.2 0.862 0.903 0.894 0.901 0.919 0.911 0.903 0.904 0.929 40-50 46 83.4 0.824 0.872 0.855 0.856 0.862 0.846 0.855 0.855 0.934 30-40 65 85.4 0.811 0.874 0.867 0.87 0.892 0.925 0.899 0.892 0.953 25-30 41 82.1 0.779 0.782 0.788 0.795 0.837 0.836 0.868 0.866 0.91 20-25 53 77.9 0.612 0.599 0.627 0.633 0.678 0.661 0.727 0.728 0.813 15-20 84 73 0.381 0.451 0.457 0.49 0.496 0.554 0.578 0.572 0.72 10-15 151 64.4 0.16 0.186 0.234 0.302 0.35 0.351 0.387 0.363 0.551 5-10 204 50.4 -0.007 -0.014 0 -0.047 0.098 0.075 0.096 0.085 0.29 0-5 122 39.5 -0.033 -0.049 -0.051 -0.034 -0.024 -0.022 -0.026 -0.025 0.127 Left three columns show results of structural alignment %ID: Structure pairs have been placed into bins based on sequence identity given the structural alignment #pair: number of pairs in each bin %Superpos: percent positions that are within ~3Angstroms RMSD (between backbone C-alpha carbons) Right three columns give Cline Shift scores for pairwise sequence alignments relative to the structural alignment. The best CS score possible is 1; negative scores indicate incorrect over-alignment with very16 few (or no) correctly aligned residue pairs. Assessing sequence alignment with respect to structural alignment** Xia Jiang Duncan Brown Nandini Krishnamurthy Kimmen Sjolander Alignment accuracy as a function of % ID (including homologs, full-length sequences) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Average CSscore 0.2 0.1 0 10-15% 15-20% 20-25% 25-30% 30-35% 35-40% Percent ID 17 CLUSTALW MUSCLE MAFFT SATCHMO Protein “domains” can be defined in many ways (structural, evolutionary, functional) Pfam “domains” sometimes (but not always) correspond to structural domains 18 Proteins are composed of modular structural domains which are found in different domain architectures produced by gene fusion and fission events Leucine-Rich Repeat (LRR) Toll-Interleukin Receptor (TIR) domain Kinase domain Promiscuous domains complicate homolog detection and function prediction 19 How many unique folds are there? SCOP: 1196 unique folds https://scop.berkeley.edu/statistics/ver=2.06 CATH: 1373 unique folds http://www.cathdb.info This only counts the number of folds found in current solved structures – it does not count the folds that exist in nature (which may be hard to solve or 20 which crystallographers haven’t yet tried to solve)! Major protein structure resources 21 SCOP and CATH structure hierarchies** • SCOP: class, fold, superfamily, family • Classification of individual structural domains (independently folding globular building blocks) • Placement in the same SCOP fold: – implies similar topology and overall “shape”; – may or may not have a common ancestor • Same superfamily: – more restrictive; implies a common ancestor – Inferred based on various analyses (including evidence from PSI-BLAST, HMMs, functional similarity) • Same family – Even more restrictive; generally implies a similar function 22 3D protein structure superposition** • Example tools: J-FATCAT, CE, VAST. • Used to evaluate protein 3D structure prediction – Compare homology models against solved structure (e.g., CASP) • To evaluate assertions of (distant) homology – Can be used to rule out homology (if structurally dissimilar) – Structural similarity does not automatically support homology • see convergent evolution • Used to organize protein structures into hierarchies – E.g., SCOP and CATH • Used to evaluate sequence alignment accuracy – Some (not all) MSA benchmarks use pairwise structural alignments (multiple structural alignment is more complicated) – but some benchmarks include proteins that do not have solved structures – Even among homologous proteins, some regions may superpose poorly. Structural aligners can disagree on how to align these regions. Benchmarking approaches may use consensus approaches across multiple structural aligners – See discussion of these benchmarking