<<

Modeling

Generating, Evaluating and Refining Protein Homology Models

Troy Wymore and Kristen Messinger Biomedical Initiatives Group Pittsburgh Supercomputing Center Homology Modeling of

ØNeed for comparative modeling of proteins ØComputational techniques for generating, evaluating and refining structures. ØExample of interfacing computational chemistry with sequence based The Need for Predictive Methods

ØThe sequences of more than a 1.2 million (March 2003) proteins have been provided by the various projects and the gap between sequence determination and structure determination continues to grow. ØThis fact increases the need for predictive methods. ØStructure determination by x-ray crystallography or NMR is still relatively difficult. Structural

Ø Estimated that approx. 1/3 of all sequences are recognizably related to at least one known structure Ø Known protein sequences > 1.2 million Ø Compared to 20,000 currently known structures Ø Homology Modeling could then provide ~400,000 structures What Is Homology Modeling?

uPredicts the three-dimensional structure of a given protein sequence (TARGET) based on an alignment to one or more known protein structures (TEMPLATES) uIf similarity between the TARGET sequence and the TEMPLATE sequence is detected, structural similarity can be assumed. uIn general, 30% sequence identity is required for generating useful models Applications of Homology Modeling

Sequence Identity

60-100% Comparable to medium resolution NMR Substrate Specificity of Small Ligands, proteins

30-60% Molecular replacement in crystallography Support site-directed mutagenesis through visualization

Marti-Renom et al. Annu. Rev. Biophys. Biomol. Struct., 2000, 29:291-325. Procedures for Comparative Protein Modeling

Start End Yes

Identify templates Model ok? No

Select templates Evaluate the model

Align target with template Build the model Programs for Model Protein Construction

u MODELLER 6.0 » guitar.rockefeller.edu/modeller/modeller.html

u SWISS-MOD Server » www.expasy.ch/swissmod/SWISS-MODEL.html

u SCWRL (SideChain placement With Rotamer Library) » www.fccc.edu/research/labs/dunbrack/scwrl/

*obviously not all inclusive Locating Domains

Ø Sequences of more than 500 amino acids are almost certain to be divided into domains. Ø Finding homologues may be easier if you can separate the sequence into domains. Ø Regions of low complexity often separate domains in multidomain proteins Finding templates http://bioinfo.pl/Meta

ØSequence based ØThreading ØAb initio ØConsensus Results Page

Red letters represent helices and blue letters represent sheets

If you scroll to the end of the sequence an alignment in PIR format and and initial model in PDB format are available http://www.pdg.cnb.uam.es/servers/libellula Factors to Consider in Selecting Templates

u Template environment » pH » Ligands present? u Resolution of the templates u Family of proteins » Phylogenetic tree construction can help find the subfamily closest to the target sequence u Multiple templates? Target-Template Alignment u No current comparative modeling method can recover from an incorrect alignment u Use multiple sequence alignments as initial guide. u Consider slightly alternative alignments in areas of uncertainty, build multiple models u Sequence-Structure alignment programs » MODELLER command –ALIGN_2D (not tested) » Tries to put gaps in variable regions/loops u Note: Sequence from database versus sequence from the actual PDB are not always identical (S2C) Differences in Multiple Sequence Alignments

0 * 240 * 260 * 280 * 1ad3 : LKPSEVSGHMADLLATLIPQY-M---DQNLYLVVKGGVPETTELLK--ERFDHIMYTGSTAVGKIVMAAAAK- : 200 1cw3 : MKVAEQTPLTALYVANLIKEAGF---PPGVVNIVPGFGPTAGAAIASHEDVDKVAFTGSTEIGRVIQVAAGSS : 254 1ad3_4 : LKPSEVSGHMADLLATLIPQY-M---DQNLYLVVKGGVPETTELL--KERFDHIMYTGSTAVGKIV-MAAAAK : 200 1cw3_4 : MKVAEQTPLTALYVANLIKEAGF---PPGVVNIVPGFGPTAGAAIASHEDVDKVAFTGSTEIGRVIQVAAGSS : 254 1ad3_5 : LKPSEVSGHMADLLATLIPQY-M---DQNLYLVVKGGVPETTELLKER--FDHIMYTGSTAVGKIV-MAAAAK : 200 1cw3_5 : MKVAEQTPLTALYVANLIKEAGF---PPGVVNIVPGFGPTAGAAIASHEDVDKVAFTGSTEIGRVIQVAAGSS : 254 1ad3_6 : LKPSEVSGHMADLLATLIPQY-M---DQNLYLVVKGGVPETTELLKER--FDHIMYTGSTAVGKIV-MAAAAK : 200 1cw3_6 : MKVAEQTPLTALYVANLIKEAGF---PPGVVNIVPGFGPTAGAAIASHEDVDKVAFTGSTEIGRVIQVAAGSS : 254 1ad3_ce : LKPSEVSGHMADLLATLIPQYM----DQNLYLVVKGGV-PETTELLKE-RFDHIMYTGSTAVGKIVMAAAA-K : 200 1cw3_ce : MKVAEQT---PLTALYVANLIKEAGFPPGVVNIVPGFGPTAGAAIASHEDVDKVAFTGSTEIGRVIQVAAGSS : 254 6K E 3 a a 6i 6 6V G p 6 D 6 5TGST 6G466 AA

Helix Sheet Inserting gap at ends of helix versus in the middle

When gaps are placed at the ends of helices, all models from these alignments resulted in RMSD versus actual of 1.3-1.8 Å. In another helical region, placing them in the middle results in RMSD of ~ 2.0 Å versus less than 1.0 Å for correct alignment. Alignment of variable regions

* 380 * 400 * 420 * 4 1ad3 : ---EDAKQSRDYGRIINDRHFQRVKGLIDNQK------VAHGGTWDQSSRYIAPTILVDVDPQSPVMQEEIFG : 335 1cw3 : ---NPFDSKTEQGPQVDETQFKKILGYINTGKQEGAKLLCGGGIAADRGYFIQPTVFGDVQDGMTIAKEEIFG : 396 1ad3_4 : ---EDAKQSRDYGRIINDRHFQRVKGLIDNQK------VAHGGTWDQSSRYIAPTILVDVDPQSPVMQEEIFG : 335 1cw3_4 : ---NPFDSKTEQGPQVDETQFKKILGYINTGKQEGAKLLCGGGIAADRGYFIQPTVFGDVQDGMTIAKEEIFG : 396 1ad3_5 : ---EDAKQSRDYGRIINDRHFQRVKGLIDNQK------VAHGGTWDQSSRYIAPTILVDVDPQSPVMQEEIFG : 335 1cw3_5 : ---NPFDSKTEQGPQVDETQFKKILGYINTGKQEGAKLLCGGGIAADRGYFIQPTVFGDVQDGMTIAKEEIFG : 396 1ad3_6 : ---EDAKQSRDYGRIINDRHFQRVKGLI-----DNQKVAHGGTW-DQSSRYIAPTILVDVDPQSPVMQEEIFG : 335 1cw3_6 : ---NPFDSKTEQGPQVDETQFKKILGYINTGKQEGAKLLCGGGIAADRGYFIQPTVFGDVQDGMTIAKEEIFG : 396 1ad3_ce : AKQ-----SRDYGRIINDRHFQRVKGLID-----NQKVAHGGTWD-QSSRYIAPTILVDVDPQSPVMQEEIFG : 335 1cw3_ce : VVGNPFDSKTEQGPQVDETQFKKILGYINTGKQEGAKLLCGGGIAADRGYFIQPTVFGDVQDGMTIAKEEIFG : 396 G 61 F 46 G I k Gg 5I PT6 DV 6 EEIFG Placement of gaps in variable regions does not affect the quality of the models. Residues 350-370 of 1cw3 produce the following RMSDs (Å). CE-4.065, clustal-4.424, pileup-4.128, SAGA— 4.613, MSA—3.920. The range of values for SAGA, clustal and pileup represent differences due to MODELLER procedures. The poor quality points out the need to use other methods to model loops. Secondary Structure Prediction

Ø Predator Ø Predict Protein Server Ø http://www.embl-heidelberg.de/predictprotein Ø SSP & NNSSP (NNSSP uses neural networks) Ø http://dot.imgen.bcm.tmc.edu:9331/pssprediction/

Ø predictioncenter.llnl.gov/other/interesting.html Other experimental techniques u Is there any other experimental data published that could aid in this model construction/alignment? » Fluorescense » Circular Dichroism » Electron microscopy » Site-directed mutagenesis u Modeller command » ADD_RESTRAINT Constructing Multi-domain protein models

Building a multi-domain protein using templates corresponding to the individual domains proteinA aaaaaaaaaaaaa------proteinB ------bbbbbbbbbbbbbbb Target aaaaaaaaaaaaabbbbbbbbbbbbbbb Effect of sequence similarity

Sequence Identity w/ 1cw3 RMSCD over domains

1bi9 67% < 1.0 ?

1ad3 27% > 2.0 ?

•We even tried using the from structure •RMSCD (root mean square coordinate difference) over Ca atoms. Multiple model approach

ØReminder: Consider the effects of different substitution matrices, different gap penalties, and different algorithms. (Vogt et al. J. Mol. Biol. 1995, 249:816-831.)

ØConstruct multiple models

ØUse structural analysis programs to determine best model

Jaroszewski, Pawlowski and Godsik, J. Molecular Modeling, 1998, 4:294-309 Venclovas, Ginalski and Fidelis. PROTEINS, 1999, 3:73-80 (Suppl) Initial model and procedures

Ø Calculate coordinates for atoms that have equivalent atoms in the templates as an average over all templates Ø CHARMM internal coordinates are used for remaining unknown coordinates Ø Generate stereochemical and homology derived restraints Spatial restraints?

MODELLER minimizes the objective function, F, with respect to the Cartesian coordinates of the protein atoms

F(R) = Sci (fi,pi)

R are the cartesian coordinates of the atoms c is a restraint dependant on f,p f is a geometric feature of a molecule and include the distance, angle and dihedral values p are parameters to help describe some restraints u The final stage of most Homology Modeling procedures involves an energy minimization of the protein structure by a molecular mechanics (MM) u MM force field has a term for bond stretching, angle bending, torsions, and nonbonded contacts (van der Waals and electrostatics)

2 2 V= åbonds 1/2 kb(b-b0) + åangles 1/2 kq(q-q0) + 12 6 åtorsion kf [1 + cos(nj - d)] + åLJ4eij[(sij/r) - (sij/r) ]

+ åCoulomb qiqj/ er SWISS-MOD Server

ØBLAST search to determine templates ØSuperposition of 3D templates by SIM- program based on diagonals of sequence similarity. ØSimilar algorithm used for the target-template alignment

ØConserved areas built in similar fashion as other programs ØNon-conserved loops taken from PDB database search ØSidechains from rotamer library ØModel refined with MM using GROMACS. SWISS-MOD examples

ØModeling class 2 Aldehyde Dehydrogenase (1cw3) using class 3 ALDH as a template resulted in the following

3.385 Å rmsd versus actual over the coenzyme domain

3.425 Å rmsd versus actual over the catalytic domain

ØThis should be compared to the CE alignment that produced a 2.0 Å rmsd model. The CE alignment closely corresponds to multiple sequence alignment of ALDHs. Clustering the ensemble u Cluster analysis, based on overall fold, followed by selection of the structure closest to the centroid of the largest cluster is likely to identify a structure more representative of the ensemble than the commonly used minimized average structure

NMRCLUST (http://neon.chem.le.ac.uk/nmrclust/protocol.html) Errors in Homology Modeling

a) packing b)Distortions and shifts c) no template Errors in Homology Modeling

d) Misalignments e) incorrect template Marti-Renom et al., Ann. Rev. Biophys. Biomol. Struct., 2000, 29:291-325. PROCHECK

http://www.biochem.ucl.ac.uk/~ roman/procheck/procheck.html PROSAII

Z-score, Zp , of an amino acid sequence is derived from the energy

Ep of the aa sequence

Zp = (Ep – E)/s E = average energy of all fragments s = standard deviation Potentials provided by applying Boltzmann’s principle to structural databank

Sippl, M., PROTEINS, 1993, 17:355-362. ProsaII

Aspartate Aminotransferase

Most accurate x-ray structure (bold dotted line) has best z-score. ProsaII

M4 apo-lactate dehydrogenase

3ldh (dotted line) has an incorrect amino acid sequence VERIFY3D

Nature, 1992, 356:83-85. www.doe-mbi.ucla.edu/Services/Verify_3D Strategies for Molecular Mechanics Refinement

Ø Restrain the region of the model protein that is more likely correct and just minimize or do simulated annealing on the suspect areas Ø Protein Force Fields (AMBER, CHARMM) have been parameterized to reproduce phase properties. They perform best when the use of explicit solvent molecules are used to solvate the structure. Other issues include the effect of long range electrostatics, etc. Ø Explicit solvent simulations to refine structures are expensive and perhaps prohibitive, so…. Generalized Born in MM

Ø Replace water molecules with a term that describes the interaction of an atom with the solvent. Can be seen as modifying the non-bonded electrostatic term. Ø Not distance-dependant dielectric (too inaccurate), not Poisson-Boltzmann ( I think, still too expensive), but Generalized Born Ø Dominy, B. and C. L. Brooks, J. Phys. Chem. B., 1999, 103:3765-3773. (CHARMM) Ø Onufriev, A., Bashford, D. and D. A. Case, J. Phys. Chem. B, 2000, 104: 3712-3720. (AMBER) Modeling of loops

ØLoops often determine the functional specificity of a given protein framework. Contribute to active and binding sites ØConformation of loop is influenced by the core stem regions that span the loop and other surrounding residues ØSeveral methods for performing loop refinement (too numerous to mention all) ØDatabase search techniques ØAb initio methods Loop Database Approach

ØDefine the query as the residues that make up the loop plus the main chain atoms that are anchored to secondary structure elements. ØSearch through the PDB using the following criteria ØGeometric (Does the conformation span the loop?) ØSequence Similarity ØApplicable to loops 7 residues long or shorter Modeling ß-secretase involved in Alzheimer’s Disease

Explains the substrate specificity of this enzyme for negatively charged residue of amyloid precursor protein Sauder, J., Arthur, J. W., and R. L. Dunbrack, J. Mol. Biol., 2000, 200:241-248.