Swiss Institute of Bioinformatics
EMBnet course: Introduction to Protein Structure Bioinformatics Homology Modeling Lausanne, February 22, 2007
Torsten Schwede Biozentrum - Universität Basel Swiss Institute of Bioinformatics Klingelbergstr 50-70 CH - 4056 Basel, Switzerland Tel: +41-61 267 15 81 How many structures do we know?
http://www.wwpdb.org/ How many structures do we know? Growth of the Protein Data Bank PDB
Total Yearly
[ PDB: http://www.pdb.org ]
[ PDB: http://www.pdb.org ] How many structures do we know?
10,000,000
1,000,000
Î No experimental structure for most protein sequences 100,000
10,000
TrEMBL 1,000 SwissProt PDB
100 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
(Sources: PDB, EBI, SIB) In the near future for most of the known protein sequences no experimental structure will be available.
Can we predict protein structures
from genome sequences?
MNIFEMLRID EGLRLKIYKD TEGYYTIGIG HLLTKSPSLN AAKSELDKAI GRNCNGVITK DEAEKLFNQD VDAAVRGILR NAKLKPVYDS LDAVRRCALI NMVFQMGETG VAGFTNSLRM LQQKRWDEAA VNLAKSRWYN QTPNRAKRVI TTFRTGTWDA YKNL MNIFEMLRID EGLRLKIYKD TEGYYTIGIG HLLTKSPSLN AAKSELDKAI GRNCNGVITK DEAEKLFNQD VDAAVRGILR NAKLKPVYDS LDAVRRCALI NMVFQMGETG VAGFTNSLRM LQQKRWDEAA VNLAKSRWYN QTPNRAKRVI TTFRTGTWDA YKNL
Many proteins fold spontaneously to their native structure Protein folding is relatively fast (nsec – sec) Chaperones speed up folding, but do not alter the structure
The protein sequence contains all information needed to create a correctly folded protein. Can we predict the folding process of a protein structure from their sequences (ab initio)? ν
Molecular Dynamics
θ θ 2 ki = ∑ ()li − li,0 bonds 2 2 ki + ∑ ()i − i,ω0 angles 2 γ V + ∑ ()N πε1+ cosσ ()n − torsions 2 σ 12 6 N N ⎛ ⎡ ⎤ ⎞ ⎜ ⎛ ij ⎞ ⎛ ij ⎞ qiq j ⎟ + 4 ⎢⎜ ⎟ − ⎜ ⎟ ⎥ + ∑∑⎜ ij ⎜ ⎟ ⎜ ⎟ ⎟ i=+11j=i ⎜ ⎢ r r ⎥ 4πε r ⎟ ⎝ ⎣⎝ ij ⎠ ⎝ ij ⎠ ⎦ 0 ij ⎠ Ab initio protein folding simulation
Physical time for simulation 10–4 seconds Typical time-step size 10–15 seconds Number of MD time steps 1011 Atoms in a typical protein and water simulation 32’000 Approximate number of interactions in force calculation 109 Machine instructions per force calculation 1000 Total number of machine instructions 1023 Petaflop capacity computer (floating point operations per second) 1 petaflop (1015)
Î Blue Gene will need 1-3 years to simulate 100 μsec.
[ http://www.research.ibm.com/bluegene/ ] Growth of the Protein Data Bank PDB
“Old” folds per year
New folds per year
[ PDB: http://www.pdb.org ] CATH - Protein Structure Classification
Class(C) derived from secondary structure content is assigned automatically
Architecture(A) describes the gross orientation of secondary structures, independent of connectivity.
Topology(T) clusters structures according to their topological connections and numbers of secondary structures
Homologous Superfamily (H) This level groups together protein domains which are thought to share a common ancestor and can therefore be described as homologous.
[ http://www.biochem.ucl.ac.uk/bsm/cath_new/ ] Sequence similarity implies structural similarity? 100
. Sequence identity 75 implies structural similarity ! 50
Don't 25 know Pairwise sequence identity region 0
Number of residues aligned (B.Rost, Columbia, NewYork) Sequence similarity implies structural similarity?
. 100 identity simil arity 80
Sequence identity 60 implies structural similarity 40
identity/similarity Don’t Percentage sequence
20 know region ..... 0 0 50 100 150 200 250 Number of residues aligned
(B.Rost, Columbia, NewYork) Fold recognition / Threading
Find a compatible fold for a given sequence ....
>Protein XY MSTLYEKLGGTTAVDLAV DKFYERVLQDDRIKHFFA ? DVDMAKQRAHQKAFLTYA FGGTDKYDGRYMREAHKE ≈ LVENHGLNGEHFDAVAED LLATLKEMGVPEDLIAEV AAVAGAPAHKRDVLNQ
Number of protein folds that occurs in nature is limited. Fold Recognition can be used to: ¾ Identify templates for comparative modeling ¾ Assign Protein Function Fold recognition / Threading
The "biological" perspective: Homologous proteins have evolved by molecular evolution from a common ancestor. If we can establish homology, we can predict aspects of structure and function of a new protein by analogy. The "physical" perspective: The native conformation of a protein corresponds to a global free energy minimum of the protein / solvent system. To identify a compatible fold, the protein sequence is "threaded" through a library of folds, and empirical energy calculations are used to evaluate compatibility. No single method is perfect. Consensus methods often perform better: ¾ MetaPP: http://cubic.bioc.columbia.edu/predictprotein/ ¾ http://bioinfo.pl/meta/ Further reading: Adam Godzik, "Fold Recognition Methods", in: "Structural Bioinformatics", Bourne & Weissig, Eds. Protein Structure / Fold Databases
PDB: http://www.pdb.org
EBI-MSD http://www.ebi.ac.uk/msd/
SCOP http://scop.mrc-lmb.cam.ac.uk/scop/
CATH http://www.biochem.ucl.ac.uk/bsm/cath_new/ Fold Recognition Servers
Meta server ¾ http://bioinfo.pl/meta/ 3DPSSM / Phyre ¾ http://www.sbg.bio.ic.ac.uk/servers/3dpssm/ ¾ http://www.sbg.bio.ic.ac.uk/~phyre/ GenTHREADER ¾ http://bioinf.cs.ucl.ac.uk/psipred/ FUGUE2 ¾ http://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html SAM ¾ http://www.cse.ucsc.edu/research/compbio/HMM-apps/T99- query.html FOLD ¾ http://fold.doe-mbi.ucla.edu/ FFAS/PDBBLAST ¾ http://bioinformatics.burnham-inst.org/ Evolution of the globin family: Evolution of protein structure families
2.5
2.0
1.5
1.0
0.5
Rmsd of backbone atoms in core 0.0 100 50 0
Percent identical residues in core [ Chothia & Lesk (1986) ]
Common core = all residues that can be superposed in 3D For proteins > 60% identical residues, the core contains > 90 % of all residues deviating less than 1.0 Å. Similar Sequence Î Similar Structure
Homology modeling = Comparative protein modeling = Knowledge-based modeling
Idea: Using experimental 3D-structures of related family members (templates) to calculate a model for a new sequence (target). Comparative Modeling
Known Structures (Templates)
Target Sequence Template Selection
Alignment Structure Evaluation & Template - Target Assessment
Structure modeling Homology Model(s) Comparative Modeling Known Structures (Templates)
Target Sequence Template Selection
Alignment Structure Evaluation & Template - Target Assessment • Protein Data Bank PDB http://www.pdb.org Structure modeling
Homology Model(s) Î Database of templates
• Separate into single chains • Remove bad structures (models) • Create BLASTable database or fold library (profiles, HMMs) Comparative Modeling Known Structures (Templates)
Target Sequence Template Selection Template selection: Alignment Structure Evaluation & Template - Target Assessment
1. Sequence Similarity / Fold Structure modeling recognition Homology Model(s) 2. Structure quality (resolution, experimental method)
3. Experimental conditions (ligands and cofactors) Comparative Modeling Known Structures (Templates)
Target Sequence Template Selection
Alignment Structure Evaluation & Template - Target Assessment • Multiple sequence alignment for pairs > 40% identity Structure modeling
or Homology Model(s) • Use structural alignment of templates to guide sequence alignment of target or • Use separate profiles for template and targets Comparative Modeling Known Structures (Templates)
Target Sequence Template Selection
Alignment Structure Evaluation & Template - Target Assessment
• Errors in template selection or Structure modeling alignment result in bad Homology models Model(s)
Î iterative cycles of alignment, modeling and evaluation
Î Built many models, choose best. Comparative Modeling Known Structures (Templates)
Target Sequence Template Selection
Alignment Structure Evaluation & Template - Target Assessment
I. Manual Model building Structure modeling
Homology II. Template based fragment Model(s) assembly – Composer (Sybyl, Tripos) – SWISS-MODEL
III. Satisfaction of spatial restraints – Modeller (Insight II, MSI) –CPH-Models I. Manual Modeling
[ http://www.expasy.org/spdbv/ ] II. Template based fragment assembly
Find structurally conserved core regions II. Template based fragment assembly
Build model core ¾ … by averaging core template backbone atoms (weighted by local sequence similarity with the target sequence). Leave non-conserved regions (loops) for later …. II. Template based fragment assembly
Loop (insertion) modeling ¾ Use the “spare part” algorithm to find compatible fragments in a Loop- Database, or “ab-initio” rebuilding (e.g. Monte Carlo, MD, GA, etc.) to build missing loops. II. Template based fragment assembly
Side Chain placement ¾ Find the most probable side chain conformation, using • homologues structure information • back-bone dependent rotamer libraries • energetic and packing criteria II. Template based fragment assembly
Rotamer Libraries
¾ Only a small fraction of all possible side chain conformations is observed in experimental structures ¾ Rotamer libraries provide an ensemble of likely conformations ¾ The propensity of rotamers depends on the backbone geometry: II. Template based fragment assembly
Energy minimization
¾ modeling method will produce unfavorable contacts and bonds ¾ Energy minimization is used to • regularize local bond and angle geometry • Relax close contacts and geometric strain ¾ extensive energy minimization will move coordinates away from real structure ⇒ keep it to a minimum ¾ SWISS-MODEL is using GROMOS 96 force field for a steepest descent Homology Modeling
III. Satisfaction of Spatial restraints
M Q T S A F G T A E III. Satisfaction of Spatial restraints
Alignment of target sequence with templates
Extraction of spatial restraints from templates
Modeling by satisfaction of spatial restraints
M Q T S A F G T A E III. Satisfaction of Spatial restraints
Some features of a protein structure:
R resolution of X-ray experiment ramino acid residue type Φ, Ψ main chain angles t secondary structure class M main chain conformation class
Χ i,, ci side chain dihedral angle class a residue solvent accessibility s residue neighborhood difference dCa -Ca distance Δd difference between two Ca -Ca distances III. Satisfaction of Spatial restraints
Feature properties can be associated with
¾ a protein (e.g. X-ray resolution)
¾ residues (e.g. solvent accessibility)
¾ pairs of residues (e.g. Ca -Ca distance) ¾ other features (e.g. main chain classes)
How can we derive modeling restraints from this data?
¾ A restraint is defined as probability density function (pdf) p(x):
x1 ∫ p(x)dx =1 p(x1≤ x < x2) = ∫ p(x)dx with x2 p(x) > 0 III. Satisfaction of Spatial restraints
Derive pdfs from frequency tables by smoothing:
a) 11 Cys residues Chi-1 angles
b) smoothed distribution from a)
c) 297 Cys Chi-1 angles as control III. Satisfaction of Spatial restraints
Combine basis pdfs to molecular probability density functions
0.2 < s'< 0.4 0.2 < s''< 0.4
0.2 < s'< 0.4 0.4 < s''< 0.6 0.4 < s'< 0.6 0.2 < s''< 0.4 III. Satisfaction of Spatial restraints
Satisfaction of spatial restraints
¾ Find the protein model with the highest probability
Variable target function:
¾ Start with a linear conformation model or a model close to the template conformation
¾ At first, use only local restraints
¾ minimize some steps using a conjugate gradient optimization
¾ repeat with introducing more and more long range restraints until all restraints are used III. Satisfaction of Spatial restraints
Optimization schedule and progress Model Accuracy Evaluation
CASP Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction http://predictioncenter.org/casp7/
EVA Evaluation of Automatic protein structure prediction [ Burkhard Rost, Andrej Sali, http://maple.bioc.columbia.edu/eva/ ] Evaluation of Automatic protein structure prediction
[ Burkhard Rost, Andrej Sali, http://maple.bioc.columbia.edu/eva/ ]
Prediction Servers New PDB Release e.g. Target Sequence 1 MNIFEMLRID EGLRLKIYKD TEGYYTIGIG HLLTKSPSLN AAKSELDKAI GRNCNGVITK
2
3
Evaluation of prediction accuracy Typical types of errors
Sequence alignment errors.
Loops which cannot be rebuilt.
Inappropriate template selection.
Subunit displacement. Typical types of errors
Sequence alignment errors.
Loops which cannot be rebuilt.
Inappropriate template selection.
Subunit displacement. Structural rearrangements ….
… cause problems for template selection and automated evaluation:
e.g. flap-region in adenylate kinases e.g. DNA-binding domains (1AKE, 4AKE) (1AWC, 1ETC)
… because they are sequence independent. Protein Structure Evaluation
Problem:
How can we identify errors in 3-dimensional protein structures (without knowing the correct answer)?
Bond & Angle Geometry
Molecular Interactions
Î Empirical Force Fields
Î Statistical Methods Empirical Force Fields e.g. GROMOS, CHARMM, AMBER, ...
Î Which type of errors in a protein structure can you identify by an empirical force filed?
Î Which type of errors are not recognized? Statistical Methods
Ramachandran Plot of backbone angles (ϕ,ψ) ¾ favored regions ¾ generously allowed regions ¾ disallowed regions
¾ Amino acids with special properties: •PRO: ϕ = 60º • GLY ( )
Similar plots for χ-angle distributions
Î Useful to identify regions with errors in geometry 1D - 3D Checks
Probability for a feature to occur in a given environment, e.g.
¾ Solvent exposed / buried
¾ Hydrophobic / polar environment
¾ Electrostatic interactions
¾ Secondary structure
See: R. Luthy (1992) Assessment of protein models with three-dimensional profiles, Nature, 356(6364):83-5 Statistical Mean Force Potentials
A I* + II III
Val13 Met80 Phe134 Ala182
B *, Met80 +, Ile86
I, Val13
III, Ala182
II, Phe134
Atomic non-local interaction energy. Atom Type Definitions Statistical Mean Force Potentials
Use inverse Boltzmann law to derive an atomic Potential of Mean Force (Ū) from the observed number of atomic pairs (i,j) within a distance shell r±Δr in the training database of protein structures:
Nobserved (i, j,r) R: gas constant U(i, j,r) = −RT ln T: temperature Nexpected (i, j,r)
Nexpected is the expected number of atomic pairs (i,j) in the same distance shell if there were no interactions between atoms (reference state).
MFP Methyl-Methyl pairs kcal/mol
Cysteine S-S-pairs
Distance Å Distance Å ANOLEA : (Atomic Non-Local Environment Assessment)
http://protein.bio.puc.cl/cardex/servers/anolea/ http://swissmodel.expasy.org/anolea/ ANOLEA
Correct Structure: PDB: 1GES
Î Detects local packing errors
Î Errors in alignments
Model with wrong alignment: PROCHECK
Checks the stereo-chemical quality of a protein structure, producing a number of plots analyzing its overall and residue-by-residue geometry.
• Covalent geometry •Planarity • Dihedral angles • Chirality • Non-bonded interactions • Main-chain hydrogen bonds • Disulphide bonds • Stereochemical parameters • Residue-by-residue analysis
Laskowski R A, MacArthur M W, Moss D S & Thornton J M (1993). PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Cryst., 26, 283-291. Morris A L, MacArthur M W, Hutchinson E G & Thornton J M (1992). Stereochemical quality of protein structure coordinates. Proteins, 12, 345-364. WhatCheck / WhatIf
WHAT IF I check my structure?
Imagine ... • An everyday situation in a biocomputing lab: "Should they use the structure?" • An everyday situation in a crystallography lab: "Should they deposit the structure already?" In a WHAT_CHECK report, each reported fact has an assigned severity:
error: severe errors encountered during the analyses. Items marked as errors are considered severe problems requiring immediate attention.
warning: Either less severe problems or uncommon structural features. These still need special attention.
note: Statistical values, plots, or other verbose results of tests and analyses that have been performed.
WHAT IF: A molecular modeling and drug design program. G.Vriend, J. Mol. Graph. (1990) 8, 52-56. Errors in protein structures. R.W.W. Hooft, G. Vriend, C. Sander, E.E. Abola, Nature (1996) 381, 272-272. WhatCheck / WhatIf report for a bad model ...
# 49 # Note: Summary report for users of a structure This is an overall summary of the quality of the structure as compared with current reliable structures. This summary is most useful for biologists seeking a good structure to use for modelling calculations.
The second part of the table mostly gives an impression of how well the model conforms to common refinement constraint values. The first part of the table shows a number of constraint-independent quality indicators.
Structure Z-scores, positive is better than average: 1st generation packing quality : -2.550 2nd generation packing quality : -5.472 (bad) Ramachandran plot appearance : -1.898 chi-1/chi-2 rotamer normality : -1.433 Backbone conformation : -2.173
RMS Z-scores, should be close to 1.0: Bond lengths : 0.905 Bond angles : 1.476 Omega angle restraints : 0.921 Side chain planarity : 2.681 (loose) whatcheck.txt Improper dihedral distribution : 1.771 (loose) Inside/Outside distribution : 1.333 (unusual) All checking tools are happy, so can I believe it now?
Models are not experimental facts !
Models can be partially inaccurate or sometimes completely wrong !
A model is a tool that helps to interpret biochemical data. Some useful Evaluation Tools
ANOLEA : (Atomic Non-Local Environment Assessment)
• http://protein.bio.puc.cl/cardex/servers/anolea/ • http://swissmodel.expasy.org/anolea/
ProCheck
• http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html
WhatCheck
• http://www.cmbi.kun.nl/gv/whatcheck/
Verify3D
• http://www.doe-mbi.ucla.edu/Services/Verify_3D/
Biotech Validation Suite for Protein Structures
• http://biotech.ebi.ac.uk:8400/ What can models be used for ?
“A Model must be wrong, in some respects, else it would be the thing itself. The trick is to see where it is right.”
(Henry A. Bent) Model quality vs. sequence identity
Midnight Zone
Twilight Zone
Save Zone What can models be used for ?
Annotation by fold assignment 3D-motif searching, active site recognition Including NMR restraints
Supporting site directed mutagenesis
X-Ray Molecular replacement models
Docking of small molecules Drug development; comparable to medium resolution NMR or low resolution X-ray structures Application example: Understanding drug interactions
The knowledge of 3-dimensional structures of target proteins allows to undertand interactions of inhibitors and drugs with their target proteins. Discovery of CK2a Inhibitors by in silico docking
Homology model of
the target molecule:
Reference:
Discovery of a potent and selective protein kinase CK2 inhibitor by high-throughput docking.
Vangrevelinghe E, Zimmermann K, Schoepfer J, Portmann R, Fabbro D, Furet P. Oncology Research, Novartis Pharma, Basle, J Med Chem. 2003 Jun 19;46(13):2656-62. Medicines are not Effective in all Patients
Inter-individual differences in drug efficacy:
Group Incomplete/absent efficacy
SSRI 10-25% ACE-I 10-30% Beta blockers 15-25% Statins 30-70% Beta2 agonists 40-70%
[ Spear BB (2001) Trends Mol Med;7(5):201-204 ] Structural analysis of human mutations and nsSNPs
6 8
4
2
1 5
3
7
-8 -4 0 +4 +8 kT/e
E.g. Changes in the electrostatic properties upon mutation Public database holdings
1'000'000
100'000
10'000
1'000 TrEMBL SwissProt PDB
100 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 Structural Genomics
• large scale experimental structure solution projects
Goal: Most of the sequences in a genome database should match at least one structure with a sufficient sequence identity allowing for reliable modeling.
The modeling error determines selection of targets for structural genomics.
Range of sequence space that can be modeled with acceptable accuracy. Structural Genomics – Target Selection Protein Modeling Resources
SWISS-MODEL http://swissmodel.expasy.org
Modeller http://www.salilab.org
WhatIf http://www.cmbi.kun.nl/whatif/
3D-JIGSAW http://www.bmm.icnet.uk/people/paulb/3dj/form.html
CPHmodels http://www.cbs.dtu.dk/services/CPHmodels/
SDSC1 http://cl.sdsc.edu/hm.html Some useful Evaluation Tools
ANOLEA : (Atomic Non-Local Environment Assessment)
• http://protein.bio.puc.cl/cardex/servers/anolea/ • http://swissmodel.expasy.org/anolea/
ProCheck
• http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html
WhatCheck
• http://www.cmbi.kun.nl/gv/whatcheck/
Verify3D
• http://www.doe-mbi.ucla.edu/Services/Verify_3D/
Biotech Validation Suite for Protein Structures
• http://biotech.ebi.ac.uk:8400/