Workshop in Computational Structural Biology (81813), Spring 2019

Exercise 6:

Homology modeling, sometimes referred to as comparative modeling, is a commonly-used alternative to ab-initio modeling (which requires zero knowledge about the native structure, but also overlooks much of our understanding of structural diversity). Homology modeling leverages the massive amount of structural knowledge available today to guide our search of the native structure. Practically, it assumes that sequence similarity implies structural similarity, i.e. assumes a homologous to our query protein cannot be very different from it. Therefore, we can roughly "thread" the query sequence onto the template (homologue) structure, and exploit existing knowledge to guide our modeling. Contents

Homology modeling with Rosetta 1 Overview 1 Template Selection 1 2 Running the full simulation 4 Bonus section! Homology modeling with automated/pre-calculated servers 6 MODBASE 6 I-Tasser 6

Homology modeling with Rosetta Overview In this section, we will model the structure of the Erbin PDZ domain, using homologue template structures, and analyze the quality of the models. The structure of this protein has been solved (PDB ID 1MFG), so we will be able to compare prediction with experiment. As taught in class, classical homology modeling in Rosetta is composed of four phases:

1. Template selection 2. Template-query alignment 3. Rebuilding loop regions 4. Full-atom refinement of the model Our model protein is the Erbin PDZ domain: Erbin contains a class-I PDZ domain that binds to the C-terminal region of the receptor tyrosine kinase ErbB2. PDB ID 1MFG is the crystal structure of the human Erbin PDZ bound to the peptide EYLGLDVPV corresponding to the C-terminal residues of human ErbB2. ErbB2 is implicated in both signaling and oncogenicity.

Template Selection The three-dimensional organization of the template determines to a great extent the quality of our result model. That is why it is such a critical step in homology modeling, and why and homology detection methods are so important in this context. Usually, successful modeling (i.e. generation of models within 3-5 Å RMSD of the experimentally solved structure) requires 20-30% sequence identity, and/or 30-50% sequence similarity to the template structure, given a reasonable alignment with a limited number of gaps.

1 Workshop in Computational Structural Biology (81813), Spring 2019

Question 1: Given two homologous structures and their sequence alignment, what structural regions are likely to be well and poorly aligned? Explain briefly. In the real world scenario, we have a sequence for which we want to know the structure. Here, we will first get the sequence from the structure that is already solved... The following commands extract chain A of 1MFG and renumber it to start from residue 1:

$> prody fetch 1mfg $> prody select "protein and chain A" 1mfg -o tmp $> sequentialPdbResSeq.pl -pdbfile tmp.pdb > 1mfgA.pdb

Next, extract the sequence of 1mfg chain A from the pdb file:

$> getFastaFromCoords.pl -pdbfile 1mfgA.pdb -chain A > 1mfgA.fasta

HHPred is a popular server for finding template models and for creating homology models. Go to https://toolkit.tuebingen.mpg.de/#/tools/hhpred. In the "Input" section, press "Upload file" and select your FASTA file (or paste the sequence to the box above). Press "Submit job". While the job is running, answer the following question: Question 2: What features (e.g., biological, structural), other than sequence alignments, do you think could be used for choosing a template structure for homology modeling? Try to think of at least two features. The results list the best template candidates from the . Press the "Help" button (top right corner) for detailed info. We'll use as template the 4th structure in the list, 2Q3G:A. You can see the alignment between 2Q3G and 1MFG, including some useful indicators, such as secondary structure predictions for 1MFG (E = sheet, H = helix, C = coiled loop), amino-acids sequence and the consensus sequence for multiple-sequence alignment to 1MFG. Question 3: What is the sequence identity between 2Q3G and 1MFG? Is 2Q3G expected to be a good template for this modeling task? Once we decide on the template we'll be using, we're left with some technical tasks of acquiring and cleaning the template structure, and generating a simple alignment file of both sequences, to be used in the next phases. Rosetta uses this alignment to determine which parts of the target can be modeled based on the template (i.e. copied from it), and which require explicit modeling. The alignment file can be created by simply copying the sequence lines from HHPred. We provide you with this file, along with others, for the purpose of this exercise. The last two steps of homology modeling (loop modeling, refinement) are also useful in other contexts, e.g. ab-initio modeling or interface design. In the ab-initio exercise, we've practiced refinement (via the fast relax protocol), and in this exercise we will practice loop modeling as an independent module.

Loop Modeling Loop modeling is a common sub-problem of homology modeling and is often one of the toughest parts of the modeling, because loops have conformational diversity and inherent flexibility. They play an important role in how interact with one another and with other molecules. Question 4:

i. Loop regions are sometimes missing from crystal structures - why do you think this might happen? ii. Explain the relation between gaps in the sequence alignment of homologue proteins and loop regions.

2 Workshop in Computational Structural Biology (81813), Spring 2019

iii. Loops are known to be important for protein function. Explain how they contribute to function, and provide an example. Examine our Erbin2 structure in PyMOL. Pay special attention to the loop found at residues 10-17 - this is the loop we are about to model (note that in real life scenarios, we would model the loop regions after copying template segments from aligned regions, making this loop modeling task slightly easier than in full homology modeling, having exact "take-off" sites). Question 5: How flexible do you think this region is? Do you think it would be easy to model this loop? Do you think it's too long/short? Create a directory for this loop modeling session, placing the structure there:

$> mkdir loopmodel $> cp 1mfgA.pdb loopmodel $> cd loopmodel

We will now define which loops we want to model, using a loop file with a very simple format. Make a file (called loopfile, for example) that contains the following line:

LOOP 10 17

This instructs Rosetta to model the loop between residue 10 to 17 using random positions within the loop as cutpoints (note that if a specific position is desired as the cutting point of the loop, a fourth column stating that position can be added). There are two main algorithms used in Rosetta to model loops, namely Kinematic loop closure (KIC) and coordinate cyclic descent (CCD). Here we will use KIC 1 . Run the loop modeling protocol using the loop file and input structure.

loopmodel.linuxgccrelease \ -database $ROSETTA_DB \ -nstruct 15 \ -s 1mfgA.pdb \ -native 1mfgA.pdb \ -out:prefix myloop \ -loops:loop_file loopfile \ # define the loops here -loops:extended false \ # use the input loop as a starting conformation -loops:remodel perturb_kic \ # choose modeling protocol -out:file:fullatom \ > log

Note

For kinematic loop closure you don’t need fragment files. If you choose to use quick_ccd, replace the modeling protocol option with -loops:remodel quick_ccd and add -loops:frag_files aa1mfgA09_05.200_v1_3.gz aa1mfgA03_05.200_v1_3.gz none (which you can obtain from the Robetta server ).

Let's examine how well we modeled the loop: Inspect the models in Pymol. Question 6: Is there any loop conformation that looks similar to the native? How many?

3 Workshop in Computational Structural Biology (81813), Spring 2019

In this run, you have used the -loops:extended false flag, which instructs Rosetta to use the input loop conformation as the starting structure for the loop modeling protocol. We ran the protocol, once with this flag set to true and once to false, only with -nstruct 5000. A pse with the lowest energy decoys is provided under $CSBW_HOME/resources/ex6/lowestEnergyLoops.pse. The score files for the two runs are also provided (their names are extendedstart_score.sc and startconf_score.sc). Question 7: a. How do results differ between the two runs, both with respect to similarity to native and with respect to scores? b. Here we use the information from the crystal structure when trying to model its loop. When starting from a homolog template, when would it be recommended to use its starting conformation and when would it not be advisable? If you recall, the 1MFG pdb is a complex of a PDZ domain binding a peptide. Let's add the peptide to the mix. Create a new folder - loopWithPep. Copy loopfile to the new directory and create there a clean copy of the complex:

$> prody fetch 1mfg $> prody select "protein" 1mfg -o 1mfg

Go to the new directory and run the same command as before to model the loop (except use 1mfg.pdb instead of 1mfgA.pdb both as the starting structure -s and as the native). In the meanwhile, inspect the complex (1mfg.pdb) in PyMOL. Question 8: (a) Why should the peptide influence the results of the loop modeling? (b) How many loops are close to native for this run? Why is that? We provide loopWithPep.pse and loopWithPep_score.sc (found in $CSBW_HOME/resources/ex6) with additional 10 conformations, the lowest-energy structures created with the same protocol only with -nstruct 5000, just in case you didn’t sample enough.

Note

Flags for high-resolution loop modeling: here we generated loop models without a refinement step. The following flags may be useful if you wish to do high-resolution loop modeling at home:

-out:file:fullatom -loops:relax fastrelax # high-resolution relaxation of the whole structure -use_input_sc -ex1 -ex2 # use additional rotamers during the modeling

Running the full simulation Now that we've covered the different phases of homology modeling, we're ready to complete the protocol modeling Erbin2 from its homolog structure 2Q3G. To make things more convenient, Rosetta offers a comparative modeling protocol that combines these three steps (copying template aligned regions, loop modeling of unaligned regions and refinement) in one command line. Create a new directory for this section (e.g. homology_modeling), and make sure you have the prerequisites to run the comparative modeling protocol in it (Copy whatever you need from the resources folder, $CSBW_HOME/resources/ex6):

1. Fasta file of the sequence to be modeled (1mfgA.fasta) 2. PDB file of the template structure (2q3gA.pdb)

4 Workshop in Computational Structural Biology (81813), Spring 2019

3. Alignment file between the two sequences (1mfgA.2q3gA.aln). We created an alignment file between 1mfgA and 2q3gA according to the alignment suggested by HHPred. 4. Fragment files of the sequence to be modeled (aa1mfgA09_05.200_v1_3.gz and aa1mfgA03_05.200_v1_3.gz, needed for loop modeling with ccd) Type the following (long) command to run it. Again, we recommend using a flags file (e.g. cm_flags) for convenience and consistency:

# indicates the comparative modeling protocol -run:protocol threading

-in:file:fasta 1mfgA.fasta -in:file:template_pdb 2q3gA.pdb

# read inputs in fullatom -in:file:fullatom

# alignment file -in:file:alignment 1mfgA.2q3gA.aln -cm:aln_format general

# what size fragments to use -loops:frag_sizes 9 3 1

# fragment files for loop modeling with ccd -loops:frag_files aa1mfgA09_05.200_v1_3.gz aa1mfgA03_05.200_v1_3.gz none

# loop modeling protocol -loops:remodel quick_ccd

# start from extended loops -loops:extended true

# build initial empty loop seq -loops:build_initial true

# fast relaxation of entire structure -loops:relax fastrelax

# fullatom output -out:file:fullatom

-out:nstruct 10

So now you only need to type the following command:

$> minirosetta.linuxgccrelease \ -database $ROSETTA_DB \ @cm_flags \ > cm.log

Output should consist of decoys S_2Q3GA_0001.pdb-S_2Q3GA_0010.pdb and a score.fasc scoring file.

5 Workshop in Computational Structural Biology (81813), Spring 2019

Question 9:

i. Is there variability in the scores? Explain why. ii. Inspect the output structures in PyMOL. Attach to the exercise an image of the model with the lowest RMSd to the native structure (you can align all structures to 1mfgA by clicking on A → align → all to this). iii. Do you think you got a good homology model? Why is that? Bonus section! Homology modeling with automated/pre-calculated servers We will now examine models from automated servers and precalculated repositories for homology models. Then, we will be able to compare these structures with our model of Erbin, and with the experimentally solved structure 1MFG.

MODBASE Modeller (http://www.salilab.org/modeller/) uses a user-given template and alignment, and generates the model, based on spatial constraints between residues. MODBASE is a dataset of pre-calculated models generated by Modeller (a huge number of structures was calculated and deposited in advance). It also includes a model of 1MFG that was based on the 2H3L structure. We downloaded this model for you (modbase-ErbinHuman-basedOn_2H3L.pdb in the resources directory). Question 10: a. The model deposited in MODBASE is almost identical to the structure of 1MFG. Why do you think this protocol was more successful in predicting this structure? (Hint: Use the PDB website and the HHSearch results to obtain more information on 2H3L). b. Show chain B of 1MFG.pdb, the bound peptide. Did the model succeed in predicting the binding pocket of the peptide?

I-Tasser I-Tasser is a particularly popular homology modeling server, and is considered to be a pretty effective one as well. It finds templates by multiple threading alignments. Then, it uses PDB fragments to complete the model, and clusters the low-energy models. The results are then re-evaluated using TM-align. Inspect the models from I-Tasser (resource Itasser_models.pse, which also contains the templates I-Tasser used in the modeling process). Question 11: a. What are the differences between the models? b. Where do the models differ from the native structure? Was the peptide affected? c. One of the templates I-Tasser used is 3ch8. This structure was detected by HHPred, but ranked relatively low. When comparing it to 1MFG, can you explain why this is?

1 Mandell DJ, Coutsias EA, Kortemme T. (2009). Sub-angstrom accuracy in protein loop reconstruction by robotics-inspired conformational sampling. Nature Methods 6(8):551-2.

6