What Is Homology Modeling?

Protein Modeling Generating, Evaluating and Refining Protein Homology Models Troy Wymore and Kristen Messinger Biomedical Initiatives Group Pittsburgh Supercomputing Center Homology Modeling of Proteins ØNeed for comparative modeling of proteins ØComputational techniques for generating, evaluating and refining structures. ØExample of interfacing computational chemistry with sequence based bioinformatics The Need for Predictive Methods ØThe amino acid sequences of more than a 1.2 million (March 2003) proteins have been provided by the various genome projects and the gap between sequence determination and structure determination continues to grow. ØThis fact increases the need for protein structure predictive methods. ØStructure determination by x-ray crystallography or NMR is still relatively difficult. Structural Genomics Ø Estimated that approx. 1/3 of all sequences are recognizably related to at least one known structure Ø Known protein sequences > 1.2 million Ø Compared to 20,000 currently known structures Ø Homology Modeling could then provide ~400,000 structures What Is Homology Modeling? uPredicts the three-dimensional structure of a given protein sequence (TARGET) based on an alignment to one or more known protein structures (TEMPLATES) uIf similarity between the TARGET sequence and the TEMPLATE sequence is detected, structural similarity can be assumed. uIn general, 30% sequence identity is required for generating useful models Applications of Homology Modeling Sequence Identity 60-100% Comparable to medium resolution NMR Substrate Specificity Docking of Small Ligands, proteins 30-60% Molecular replacement in crystallography Support site-directed mutagenesis through visualization Marti-Renom et al. Annu. Rev. Biophys. Biomol. Struct., 2000, 29:291-325. Procedures for Comparative Protein Modeling Start End Yes Identify templates Model ok? No Select templates Evaluate the model Align target with template Build the model Programs for Model Protein Construction u MODELLER 6.0 » guitar.rockefeller.edu/modeller/modeller.html u SWISS-MOD Server » www.expasy.ch/swissmod/SWISS-MODEL.html u SCWRL (SideChain placement With Rotamer Library) » www.fccc.edu/research/labs/dunbrack/scwrl/ *obviously not all inclusive Locating Domains Ø Sequences of more than 500 amino acids are almost certain to be divided into domains. Ø Finding homologues may be easier if you can separate the sequence into domains. Ø Regions of low complexity often separate domains in multidomain proteins Finding templates http://bioinfo.pl/Meta ØSequence based ØThreading ØAb initio ØConsensus Results Page Red letters represent helices and blue letters represent sheets If you scroll to the end of the sequence an alignment in PIR format and and initial model in PDB format are available http://www.pdg.cnb.uam.es/servers/libellula Factors to Consider in Selecting Templates u Template environment » pH » Ligands present? u Resolution of the templates u Family of proteins » Phylogenetic tree construction can help find the subfamily closest to the target sequence u Multiple templates? Target-Template Alignment u No current comparative modeling method can recover from an incorrect alignment u Use multiple sequence alignments as initial guide. u Consider slightly alternative alignments in areas of uncertainty, build multiple models u Sequence-Structure alignment programs » MODELLER command –ALIGN_2D (not tested) » Tries to put gaps in variable regions/loops u Note: Sequence from database versus sequence from the actual PDB are not always identical (S2C) Differences in Multiple Sequence Alignments 0 * 240 * 260 * 280 * 1ad3 : LKPSEVSGHMADLLATLIPQY-M---DQNLYLVVKGGVPETTELLK--ERFDHIMYTGSTAVGKIVMAAAAK- : 200 1cw3 : MKVAEQTPLTALYVANLIKEAGF---PPGVVNIVPGFGPTAGAAIASHEDVDKVAFTGSTEIGRVIQVAAGSS : 254 1ad3_4 : LKPSEVSGHMADLLATLIPQY-M---DQNLYLVVKGGVPETTELL--KERFDHIMYTGSTAVGKIV-MAAAAK : 200 1cw3_4 : MKVAEQTPLTALYVANLIKEAGF---PPGVVNIVPGFGPTAGAAIASHEDVDKVAFTGSTEIGRVIQVAAGSS : 254 1ad3_5 : LKPSEVSGHMADLLATLIPQY-M---DQNLYLVVKGGVPETTELLKER--FDHIMYTGSTAVGKIV-MAAAAK : 200 1cw3_5 : MKVAEQTPLTALYVANLIKEAGF---PPGVVNIVPGFGPTAGAAIASHEDVDKVAFTGSTEIGRVIQVAAGSS : 254 1ad3_6 : LKPSEVSGHMADLLATLIPQY-M---DQNLYLVVKGGVPETTELLKER--FDHIMYTGSTAVGKIV-MAAAAK : 200 1cw3_6 : MKVAEQTPLTALYVANLIKEAGF---PPGVVNIVPGFGPTAGAAIASHEDVDKVAFTGSTEIGRVIQVAAGSS : 254 1ad3_ce : LKPSEVSGHMADLLATLIPQYM----DQNLYLVVKGGV-PETTELLKE-RFDHIMYTGSTAVGKIVMAAAA-K : 200 1cw3_ce : MKVAEQT---PLTALYVANLIKEAGFPPGVVNIVPGFGPTAGAAIASHEDVDKVAFTGSTEIGRVIQVAAGSS : 254 6K E 3 a a 6i 6 6V G p 6 D 6 5TGST 6G466 AA Helix Sheet Turn Inserting gap at ends of helix versus in the middle When gaps are placed at the ends of helices, all models from these alignments resulted in RMSD versus actual of 1.3-1.8 Å. In another helical region, placing them in the middle results in RMSD of ~ 2.0 Å versus less than 1.0 Å for correct alignment. Alignment of variable regions * 380 * 400 * 420 * 4 1ad3 : ---EDAKQSRDYGRIINDRHFQRVKGLIDNQK------VAHGGTWDQSSRYIAPTILVDVDPQSPVMQEEIFG : 335 1cw3 : ---NPFDSKTEQGPQVDETQFKKILGYINTGKQEGAKLLCGGGIAADRGYFIQPTVFGDVQDGMTIAKEEIFG : 396 1ad3_4 : ---EDAKQSRDYGRIINDRHFQRVKGLIDNQK------VAHGGTWDQSSRYIAPTILVDVDPQSPVMQEEIFG : 335 1cw3_4 : ---NPFDSKTEQGPQVDETQFKKILGYINTGKQEGAKLLCGGGIAADRGYFIQPTVFGDVQDGMTIAKEEIFG : 396 1ad3_5 : ---EDAKQSRDYGRIINDRHFQRVKGLIDNQK------VAHGGTWDQSSRYIAPTILVDVDPQSPVMQEEIFG : 335 1cw3_5 : ---NPFDSKTEQGPQVDETQFKKILGYINTGKQEGAKLLCGGGIAADRGYFIQPTVFGDVQDGMTIAKEEIFG : 396 1ad3_6 : ---EDAKQSRDYGRIINDRHFQRVKGLI-----DNQKVAHGGTW-DQSSRYIAPTILVDVDPQSPVMQEEIFG : 335 1cw3_6 : ---NPFDSKTEQGPQVDETQFKKILGYINTGKQEGAKLLCGGGIAADRGYFIQPTVFGDVQDGMTIAKEEIFG : 396 1ad3_ce : AKQ-----SRDYGRIINDRHFQRVKGLID-----NQKVAHGGTWD-QSSRYIAPTILVDVDPQSPVMQEEIFG : 335 1cw3_ce : VVGNPFDSKTEQGPQVDETQFKKILGYINTGKQEGAKLLCGGGIAADRGYFIQPTVFGDVQDGMTIAKEEIFG : 396 G 61 F 46 G I k Gg 5I PT6 DV 6 EEIFG Placement of gaps in variable regions does not affect the quality of the models. Residues 350-370 of 1cw3 produce the following RMSDs (Å). CE-4.065, clustal-4.424, pileup-4.128, SAGA— 4.613, MSA—3.920. The range of values for SAGA, clustal and pileup represent differences due to MODELLER procedures. The poor quality points out the need to use other methods to model loops. Secondary Structure Prediction Ø Predator Ø Predict Protein Server Ø http://www.embl-heidelberg.de/predictprotein Ø SSP & NNSSP (NNSSP uses neural networks) Ø http://dot.imgen.bcm.tmc.edu:9331/pssprediction/ Ø predictioncenter.llnl.gov/other/interesting.html Other experimental techniques u Is there any other experimental data published that could aid in this model construction/alignment? » Fluorescense » Circular Dichroism » Electron microscopy » Site-directed mutagenesis u Modeller command » ADD_RESTRAINT Constructing Multi-domain protein models Building a multi-domain protein using templates corresponding to the individual domains proteinA aaaaaaaaaaaaa--------------------- proteinB -----------------bbbbbbbbbbbbbbb Target aaaaaaaaaaaaabbbbbbbbbbbbbbb Effect of sequence similarity Sequence Identity w/ 1cw3 RMSCD over domains 1bi9 67% < 1.0 ? 1ad3 27% > 2.0 ? •We even tried using the sequence alignment from structure •RMSCD (root mean square coordinate difference) over Ca atoms. Multiple model approach ØReminder: Consider the effects of different substitution matrices, different gap penalties, and different algorithms. (Vogt et al. J. Mol. Biol. 1995, 249:816-831.) ØConstruct multiple models ØUse structural analysis programs to determine best model Jaroszewski, Pawlowski and Godsik, J. Molecular Modeling, 1998, 4:294-309 Venclovas, Ginalski and Fidelis. PROTEINS, 1999, 3:73-80 (Suppl) Initial model and procedures Ø Calculate coordinates for atoms that have equivalent atoms in the templates as an average over all templates Ø CHARMM internal coordinates are used for remaining unknown coordinates Ø Generate stereochemical and homology derived restraints Spatial restraints? MODELLER minimizes the objective function, F, with respect to the Cartesian coordinates of the protein atoms F(R) = Sci (fi,pi) R are the cartesian coordinates of the atoms c is a restraint dependant on f,p f is a geometric feature of a molecule and include the distance, angle and dihedral values p are parameters to help describe some restraints Molecular Mechanics u The final stage of most Homology Modeling procedures involves an energy minimization of the protein structure by a molecular mechanics (MM) force field u MM force field has a term for bond stretching, angle bending, dihedral angle torsions, and nonbonded contacts (van der Waals and electrostatics) 2 2 V= åbonds 1/2 kb(b-b0) + åangles 1/2 kq(q-q0) + 12 6 åtorsion kf [1 + cos(nj - d)] + åLJ4eij[(sij/r) - (sij/r) ] + åCoulomb qiqj/ er SWISS-MOD Server ØBLAST search to determine templates ØSuperposition of 3D templates by SIM-structural alignment program based on diagonals of sequence similarity. ØSimilar algorithm used for the target-template alignment ØConserved areas built in similar fashion as other programs ØNon-conserved loops taken from PDB database search ØSidechains from rotamer library ØModel refined with MM using GROMACS. SWISS-MOD examples ØModeling class 2 Aldehyde Dehydrogenase (1cw3) using class 3 ALDH as a template resulted in the following 3.385 Å rmsd versus actual over the coenzyme domain 3.425 Å rmsd versus actual over the catalytic domain ØThis should be compared to the CE alignment that produced a 2.0 Å rmsd model. The CE alignment closely corresponds to multiple sequence alignment of ALDHs. Clustering the ensemble u Cluster analysis, based on overall fold,

Load more