Exercise 6: Homology Modeling
Total Page:16
File Type:pdf, Size:1020Kb
Workshop in Computational Structural Biology (81813), Spring 2019 Exercise 6: Homology Modeling Homology modeling, sometimes referred to as comparative modeling, is a commonly-used alternative to ab-initio modeling (which requires zero knowledge about the native structure, but also overlooks much of our understanding of structural diversity). Homology modeling leverages the massive amount of structural knowledge available today to guide our search of the native structure. Practically, it assumes that sequence similarity implies structural similarity, i.e. assumes a protein homologous to our query protein cannot be very different from it. Therefore, we can roughly "thread" the query sequence onto the template (homologue) structure, and exploit existing knowledge to guide our modeling. Contents Homology modeling with Rosetta 1 Overview 1 Template Selection 1 Loop Modeling 2 Running the full simulation 4 Bonus section! Homology modeling with automated/pre-calculated servers 6 MODBASE 6 I-Tasser 6 Homology modeling with Rosetta Overview In this section, we will model the structure of the Erbin PDZ domain, using homologue template structures, and analyze the quality of the models. The structure of this protein has been solved (PDB ID 1MFG), so we will be able to compare prediction with experiment. As taught in class, classical homology modeling in Rosetta is composed of four phases: 1. Template selection 2. Template-query alignment 3. Rebuilding loop regions 4. Full-atom refinement of the model Our model protein is the Erbin PDZ domain: Erbin contains a class-I PDZ domain that binds to the C-terminal region of the receptor tyrosine kinase ErbB2. PDB ID 1MFG is the crystal structure of the human Erbin PDZ bound to the peptide EYLGLDVPV corresponding to the C-terminal residues of human ErbB2. ErbB2 is implicated in both signaling and oncogenicity. Template Selection The three-dimensional organization of the template determines to a great extent the quality of our result model. That is why it is such a critical step in homology modeling, and why sequence alignment and homology detection methods are so important in this context. Usually, successful modeling (i.e. generation of models within 3-5 Å RMSD of the experimentally solved structure) requires 20-30% sequence identity, and/or 30-50% sequence similarity to the template structure, given a reasonable alignment with a limited number of gaps. 1 Workshop in Computational Structural Biology (81813), Spring 2019 Question 1: Given two homologous structures and their sequence alignment, what structural regions are likely to be well and poorly aligned? Explain briefly. In the real world scenario, we have a sequence for which we want to know the structure. Here, we will first get the sequence from the structure that is already solved... The following commands extract chain A of 1MFG and renumber it to start from residue 1: $> prody fetch 1mfg $> prody select "protein and chain A" 1mfg -o tmp $> sequentialPdbResSeq.pl -pdbfile tmp.pdb > 1mfgA.pdb Next, extract the sequence of 1mfg chain A from the pdb file: $> getFastaFromCoords.pl -pdbfile 1mfgA.pdb -chain A > 1mfgA.fasta HHPred is a popular server for finding template models and for creating homology models. Go to https://toolkit.tuebingen.mpg.de/#/tools/hhpred. In the "Input" section, press "Upload file" and select your FASTA file (or paste the sequence to the box above). Press "Submit job". While the job is running, answer the following question: Question 2: What features (e.g., biological, structural), other than sequence alignments, do you think could be used for choosing a template structure for homology modeling? Try to think of at least two features. The results list the best template candidates from the Protein Data Bank. Press the "Help" button (top right corner) for detailed info. We'll use as template the 4th structure in the list, 2Q3G:A. You can see the alignment between 2Q3G and 1MFG, including some useful indicators, such as secondary structure predictions for 1MFG (E = sheet, H = helix, C = coiled loop), amino-acids sequence and the consensus sequence for multiple-sequence alignment to 1MFG. Question 3: What is the sequence identity between 2Q3G and 1MFG? Is 2Q3G expected to be a good template for this modeling task? Once we decide on the template we'll be using, we're left with some technical tasks of acquiring and cleaning the template structure, and generating a simple alignment file of both sequences, to be used in the next phases. Rosetta uses this alignment to determine which parts of the target can be modeled based on the template (i.e. copied from it), and which require explicit modeling. The alignment file can be created by simply copying the sequence lines from HHPred. We provide you with this file, along with others, for the purpose of this exercise. The last two steps of homology modeling (loop modeling, refinement) are also useful in other contexts, e.g. ab-initio modeling or interface design. In the ab-initio exercise, we've practiced refinement (via the fast relax protocol), and in this exercise we will practice loop modeling as an independent module. Loop Modeling Loop modeling is a common sub-problem of homology modeling and is often one of the toughest parts of the modeling, because loops have conformational diversity and inherent flexibility. They play an important role in how proteins interact with one another and with other molecules. Question 4: i. Loop regions are sometimes missing from crystal structures - why do you think this might happen? ii. Explain the relation between gaps in the sequence alignment of homologue proteins and loop regions. 2 Workshop in Computational Structural Biology (81813), Spring 2019 iii. Loops are known to be important for protein function. Explain how they contribute to function, and provide an example. Examine our Erbin2 structure in PyMOL. Pay special attention to the loop found at residues 10-17 - this is the loop we are about to model (note that in real life scenarios, we would model the loop regions after copying template segments from aligned regions, making this loop modeling task slightly easier than in full homology modeling, having exact "take-off" sites). Question 5: How flexible do you think this region is? Do you think it would be easy to model this loop? Do you think it's too long/short? Create a directory for this loop modeling session, placing the structure there: $> mkdir loopmodel $> cp 1mfgA.pdb loopmodel $> cd loopmodel We will now define which loops we want to model, using a loop file with a very simple format. Make a file (called loopfile, for example) that contains the following line: LOOP 10 17 This instructs Rosetta to model the loop between residue 10 to 17 using random positions within the loop as cutpoints (note that if a specific position is desired as the cutting point of the loop, a fourth column stating that position can be added). There are two main algorithms used in Rosetta to model loops, namely Kinematic loop closure (KIC) and coordinate cyclic descent (CCD). Here we will use KIC 1 . Run the loop modeling protocol using the loop file and input structure. loopmodel.linuxgccrelease \ -database $ROSETTA_DB \ -nstruct 15 \ -s 1mfgA.pdb \ -native 1mfgA.pdb \ -out:prefix myloop \ -loops:loop_file loopfile \ # define the loops here -loops:extended false \ # use the input loop as a starting conformation -loops:remodel perturb_kic \ # choose modeling protocol -out:file:fullatom \ > log Note For kinematic loop closure you don’t need fragment files. If you choose to use quick_ccd, replace the modeling protocol option with -loops:remodel quick_ccd and add -loops:frag_files aa1mfgA09_05.200_v1_3.gz aa1mfgA03_05.200_v1_3.gz none (which you can obtain from the Robetta server ). Let's examine how well we modeled the loop: Inspect the models in Pymol. Question 6: Is there any loop conformation that looks similar to the native? How many? 3 Workshop in Computational Structural Biology (81813), Spring 2019 In this run, you have used the -loops:extended false flag, which instructs Rosetta to use the input loop conformation as the starting structure for the loop modeling protocol. We ran the protocol, once with this flag set to true and once to false, only with -nstruct 5000. A pse with the lowest energy decoys is provided under $CSBW_HOME/resources/ex6/lowestEnergyLoops.pse. The score files for the two runs are also provided (their names are extendedstart_score.sc and startconf_score.sc). Question 7: a. How do results differ between the two runs, both with respect to similarity to native and with respect to scores? b. Here we use the information from the crystal structure when trying to model its loop. When starting from a homolog template, when would it be recommended to use its starting conformation and when would it not be advisable? If you recall, the 1MFG pdb is a complex of a PDZ domain binding a peptide. Let's add the peptide to the mix. Create a new folder - loopWithPep. Copy loopfile to the new directory and create there a clean copy of the complex: $> prody fetch 1mfg $> prody select "protein" 1mfg -o 1mfg Go to the new directory and run the same command as before to model the loop (except use 1mfg.pdb instead of 1mfgA.pdb both as the starting structure -s and as the native). In the meanwhile, inspect the complex (1mfg.pdb) in PyMOL. Question 8: (a) Why should the peptide influence the results of the loop modeling? (b) How many loops are close to native for this run? Why is that? We provide loopWithPep.pse and loopWithPep_score.sc (found in $CSBW_HOME/resources/ex6) with additional 10 conformations, the lowest-energy structures created with the same protocol only with -nstruct 5000, just in case you didn’t sample enough.