<<

Comparative modeling: Homology Modeling and Threading

Some slides modified from Kristen Huber, Umass & Charles Yan, Utah State University Types of Structure Prediction

• De novo prediction – methods seek to build three‐dimensional protein models "from scratch" – Example: Rosetta • Comparative protein structure prediction – modeling uses previously solved structures as starting points, or templates. – Example: protein threading What is comparative modeling

In general, comparative modeling consists of – Selection of one or more templates from a database. • BLAST (for closely related sequences). • PSI‐BLAST (for distantly related sequences). • A single template rarely provides a complete model. Alternative template structures may provide some additional structural features. – Alignment to the target sequence. • Require a correct alignment of the target and template sequences. This is not trivial, especially when the similarity is not very high. – Refinement of geometry and regions of low sequence identity.

3 Comparative modeling

• Homology modeling • threading Comparative modeling

Sequence

Sequence Homology To known fold 30‐40% <30% homology Threading Modeling

Yes Match Found?

No

Model Ab initio challenges

Challenges – Aligning the target sequence onto the template structure or structures is challenging, and typically results in very significant errors. – Generally, a significant fraction of residues in a target will have no structural equivalent in an available template. Reliably building regions of the structure not present in a template remains a challenge. – Side chain accuracy of these approximate models is poor. – Refinement remains the principal bottleneck to progress.

6 Sequence comparison improves fast

Improving sequence comparison techniques have broadened the scope of comparative modeling. While 30% sequence similarity was considered to be the threshold for successful comparative modeling, predictions for targets with as low as 17% sequence similarity were made during the CASP4 experiment and 6% during CASP5. The importance of comparative modeling will continue to grow as the number of experimentally determined structures grows steadily and, therefore, the number of sequences that can be related to aknown structure is growing.

7 Little progress in refining templates (until 2018)

• Comparative modeling methods hardly differ with respect to template selection and alignment. • Little progress in refining templates. Early hopes that methods would allow refinement have not been fulfilled. Reasons for this are a matter of hot debate within the field, with three suggested inter‐related explanations: inadequate sampling of alternative conformations, insufficiently accurate description of the inter‐ atomic forces and too short trajectories.

8 Homology Modeling Defined

– Homolog of a protein is related to it by divergent from a common ancestor – Based on the reasonable assumption that two homologous will share very similar structures. – Given the sequence of an unknown structure and the solved structure of a homologous protein, each amino acid in the solved structure is mutated computationally, into the corresponding amino acid from the unknown structure. – The accuracy of predictions by homology modeling strongly depends on the degree of sequence similarity. Why Homology Modeling?

• Value in structure based • Find common catalytic sites/molecular recognition sites • Use as a guide to planning and interpreting experiments • 70‐80% chance a protein has a similar fold to the target protein based on known structures from X‐ray crystallography or NMR spectroscopy • Sometimes it’s the only option or best guess Similarity of primary sequences matters

• If the target and the template share more than 50% of their sequences, predictions usually are of high quality and have been shown to be as accurate as low‐resolution X‐ray predictions. • For 30–50% sequence identity more than 80% of the C‐atoms can be expected to be within 3.5 ˚A of their true positions. • For less than 30% sequence identity, the prediction is likely to contain significant errors

11 factors affecting the quality of homology modeling The quality of the homology model is dependent on the quality of the and template structure.

The approach can be complicated by the presence of alignment gaps (commonly called indels) that indicate a structural region present in the target but not in the template, and by structure gaps in the template that arise from poor resolution in the experimental procedure (usually X‐ray crystallography) used to solve the structure. The quality of homology modeling

Model quality declines with decreasing sequence identity; a typical model has ~1‐2 Å root mean square deviation between the matched Cα atoms at 70% sequence identity but only 2‐4 Å agreement at 25% sequence identity. However, the errors are significantly higher in the loop regions, where the amino acid sequences of the target and template proteins may be completely different Homology Modeling Limitations

• Cannot study conformational changes • Cannot find new catalytic/binding sites • Large Bias towards structure of template • Models cannot be docked together Comparative modeling

Sequence

Sequence Homology To known fold 30‐40% <30% homology Threading Modeling

Yes Match Found?

No

Model Ab initio Protein Threading

Protein threading, also known as fold recognition, is a method of protein modeling, that is, computational protein structure prediction, which is used to model those proteins which have the same fold as those of proteins of known structures but do not have homologous proteins with known structure. Different from homology modeling

Protein threading is different from the homology modeling method of protein structure prediction in the sense that it is used for proteins which do not have their homologous protein structures deposited in the pdb. Protein threading predicts protein structures by using statistical knowledge of the relationship between the structures deposited in the PDB and the sequence of the protein which is wished to be modeled. Protein Threading

• The word threading implies that one drags the sequence (ACDEFG...) step by step through each location on each template Protein Threading Why Threading (I)

• While similar sequence implies similar structure, the converse is in general not true. • In contrast, similar structures are often found for proteins for which no sequence similarity to any known structure can be detected. • As a consequence, the repertoire of different folds is more limited than suggested by sequence diversity.

20 Why threading (II) • Fold recognition methods are motivated by the notion that structure is evolutionary more conserved than sequence. • Fold recognition methods are one class of comparative modeling methods that aim at predicting the three‐dimensional folded structure for amino acid sequences for which homology methods provide no reliable prediction. • Since the number of sequences is much larger than the number of folds, fold recognition methods attempt to identify a model fold for a given target sequence among the known folds even if no sequence similarity can be detected.

21 Threading • Threading‐based methods are known to be computationally expensive. • Globally optimal protein threading is known to be NP‐hard • Several threading methods ignore pairwise interaction between residues.In doing so, the threading problem is simplified considerably, and the simplified problem can be solved with dynamic programming

22 Threading • In early methods of this kind, a one dimensional string of features was recorded for known folds and compared to the target sequence. • The recorded features comprise attributes like buried side chain area, side chain area covered by polar atoms including water, and the local secondary structure. • In this manner, the three‐dimensional structure of known proteins is converted into a one‐dimensional sequence of descriptors and fold recognition is reduced to seeking the most favorable sequence alignment between the query sequence and a database of sequences. • Recent approaches take into account pair‐wise residue interaction potentials that describe a mean force derived from a database of known structures.

23 Threading Methods

• Bowie, Lüthy and Eisenberg (1991) • 2 approaches to recognition methods • Derive a 1‐D profile for each structure in the fold library and align the target sequence to these profiles – Identify amino acids based on core or external positions – Part of secondary structure • Consider the full 3‐D structure of the protein template – Modeled as a set of inter‐atomic distances – NP‐Hard (if include interactions of multiple residues) Threading based on secondary structure

• One approach to fold recognition is based on secondary structure prediction and comparison. • This subclass of methods is based on the observation that secondary structure similarity can exceed 80% for sequences that exhibit less than 10% sequence similarity. • Clearly any such approach can only be as good as the underlying secondary structure prediction method.

25 accuracy

Accuracy of secondary structure predictions. – 60% (1990s) – 76% (Current)

26 Some Threading Programs

• 3D‐pssm (ICNET). Based on sequence profiles, solvatation potentials and secondary structure. • TOPITS (PredictProtein server) (EMBL). Based on coincidence of secondary structure and accesibility. • UCLA‐DOE Structure Prediction Server (UCLA). Executes various threading programs and report a consensus. • 123D+ Combines substitution matrix, secondary structure prediction, and contact capacity potentials. • SAM/HMM (UCSC). Basen on Markov models of alignments of crystalized proteins. • FAS (Burnham Institute). Based on profile‐profile matching algorithms of the query sequence with sequences from clustered PDB database. • PSIPRED‐GenThreader (Brunel) • THREADER2 (Warwick). Based on solvatation potentials and contacts obtained from crystalized proteins. • ProFIT CAME (Salzburg) A more complete list

• http://en.wikipedia.org/wiki/Protein_structur e_prediction_software Comparative Modeling

• SWISS‐MODEL http://swissmodel.expasy.org//SWISS‐ MODEL.html

29 An example of protein

30 1. Problem

Problem: Protein docking analysis between OPA1 (Ligand) and SIRT‐3 (Receptor)

Input: (1) PDB () format files for OPA1 and SIRT‐3 (2) Residue selection: They are interested in the Lys228 in OPA1 (which means 228 (Lys) Residue in the ligand OPA1). Output: Protein docking result 1. Problem

A little basic of protein docking: Docking is a method which predicts the preferred orientation of one molecule to a second when bound to each other to form a stable complex. Generally, it is searching for all kinds of different binding "Pose" (orientation) of two proteins and then scoring those different "Poses". With the calculated scores, the software make the prediction of docking. 2. Method

There are several tools can do the protein docking analysis, the following are widely used. Global Docking searching tool:  2.1 HEX (http://hex.loria.fr/hex.php)  2.2 Zdock(http://zdock.umassmed.edu/)  2.3 Patchdock(http://bioinfo3d.cs.tau.ac.il/PatchDock/) Local docking searching tools: Use results from global docking tools as the input, provide more accurate results  2.4 Rosettadock (http://rosettadock.graylab.jhu.edu/docking2/submit)

*Note: We use a combination of Global +Local docking tools to get the best docking predictions. 2.1 HEX (1)Load in Receptor and Ligand (in .PDB format) 2.1 HEX

(2) Setup Orientation (Controls‐>Orientation)

Specify interface here! 2.1 HEX

(3) Setup Docking(Controls‐>Docking)

They will give default parameters For docking. If you don’t know how to Setup, just use the defaults.

Click to run docking 2.1 HEX

(4) Save results (File‐>Save‐>Both) It will output comprehensive PDB format file. 2.2 ZDOCK

(1)Load in receptor and Ligand

Receptor

Ligand

Your Email, they will send The results to the email 2.2 ZDOCK (2) Set up Contact and Blocking Residues

(3) The docking results is in PDB format 2.3 PATCHDOCK (1)Load in Receptor and Ligand PDBs 2.3 PATCHDOCK (2) Setting up the binding Sites (Optional)

Binding site file is a TEXT file with the following format: [residue index] [chain ID], for example, 199 of chain F ‐‐‐‐‐‐‐‐ 199 F ‐‐‐‐‐‐‐‐‐ If there is only one chain, then just use 199 ‐‐‐‐‐‐‐ 199 ‐‐‐‐‐‐‐ 2.4 RosettaDock

(1) Load in combined PBD file, RosettaDock must relies on the predictions from other global searching tools (like HEX,ZDOCK and etc.) 2.4 RosettaDock

Input format description: a. The input PDB is a combined PDB format. when creating a combined single input file with both partners, place a TER line between the two partners. Remove all other TER lines. This can be done by a TEXT editor, EXCEL or a simple script. b. RosettaDock involving 2 binding chains, therefore only keep 2 chains( one from receptor, one from ligand). These 2 chains can be decided by the results from other software (like HEX or ZDOCK). Rename the chain ID to A and B. c. Specify the docking partner. (If you rename the chain ID to A and B in the combined PDB format, use A_B) 3. Docking Results Analysis

(1)Docking Results format The output results from these software were all in PDB format. http://deposit.rcsb.org/adit/docs/pdb_atom_for mat.html 3. Docking Results Analysis

(2) Visualize the results The results can be visualized by using JMOL 3. Docking Results Analysis

In order to use JMOL, we need to process it first. The output PDB need to be modified into combined PDB format. In other words, they should be 2 models in a single PDB format.

Model 1 (Receptor)

Model 12 (Ligand) 3. Docking Results Analysis 3. Docking Results Analysis

(3) Use Jmol to analyze PDB result file to get other information

(e.g. SIRT3‐OPA1_68000, we specify the binding sites on the OPA1_68000 (Ligand), we get the predictions (PDB), we try to figure out the corresponding sites on the Receptor part (SIRT3). We can use Jmol to analyze the PBD file to get the information) 3. Docking Results Analysis

Hex+Rosetta result (for ZDock+Rosetta (for LYS228): LYS228): Best docking site: Best docking site: Around Around AA277 of chain F AA276 of chain F 3. Docking Results Analysis

Around AA 276 of chain F are predicted by multiple docking prediction tools to be the best docking position for the interested LYS228 residue.

Therefore, we conclude that the docking is around AA 276 of chain F of SIRT‐3 for the interested LYS228 residue in the OPA1.