Computational Molecular Biology Biochem 218 – BioMedical Informatics 231 http://biochem218.stanford.edu/
Protein Structural Motifs
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Homework 5: Phylogenies
• For this homework assignment take 20 to 30 protein sequences which are at least 30% similar or better and: o 1) make a multiple sequence alignment with them using ClustalW and o 2) make two phylogenies, one using UPGMA method and the other using the Neighbor Joining method • Describe the resulting alignments and include graphic images of the phylogenies in a message to [email protected] • Mention if the trees seem reasonable biologically or taxonomically by comparison with standard taxonomies • Do the two trees have the same topology? • Do the trees have the same branch lengths? • If the two trees do not have the same topology or branch lengths, describe the differences and indicate why you think the two trees differ. Are the differences significant? • Do the trees show evidence of paralogous evolution? Which nodes are orthologous and which are paralogous bifurcations? • Do the trees show evidence of either gene conversion or horizontal gene transfer? Final Projects Due March 12
• Examples of Previous Final Projects o http://biochem218.stanford.edu/Projects.html • Critical review of any area of computational molecular biology. o Area from the lectures but in more depth o Any other area of bioinformatics or genomics focused on computational approaches • Proposed improvement or novel approach • Can be a combined experimental/computational method. • Could be an implementation or just pseudocode. • Please do a MeSH literature search for Reviews on your topic. Some useful MeSH terms include: o Algorithms o Statistics o Molecular Sequence Data o Molecular Structure etc. • Please send a proposed final project topic to [email protected] by next Friday Protein Structure Computational Goals
• Compare all known structures to each other • Compute distances between protein structures • Classify and organize all structures in a biologically meaningful way • Discover conserved substructure domain • Discover conserved substructural motifs • Find common folding patterns and structural/functional motifs • Discover relationship between structure and function. • Study interactions between proteins and other proteins, ligands and DNA (Protein Docking) • Use known structures and folds to infer structure from sequence (Protein Threading) • Use known structural motifs to infer function from structure • Many more… Structural Classification of Proteins (SCOP) http://scop.berkeley.edu/
• Class o Similar secondary structure content o All α, all β, alternating α/ β etc
• Fold (Architecture) o Major structural similarity o SSE’s in similar arrangement
• Superfamily (Topology) o Probable common ancestry o HMM family membership
• Family o Clear evolutionary relationship o Pairwise sequence similarity > 25% Classes of Protein Structures
• Mainly α • Mainly β α β alternating o Parallel β sheets, β-α-β units
• α β o Anti-parallel β sheets, segregated α and β regions o helices mostly on one side of sheet Classes of Protein Structures
• Others o Multi-domain, membrane and cell surface, small proteins, peptides and fragments, designed proteins Folds / Architectures
• Mainly α • α/ βa nd α+β o Bundle • Closed o Non-Bundle • • Mainly β Barrel o Single sheet • Roll, ... o Roll • Open o Barrel • Sandwich o Clam • Clam, ... o Sandwich o Prism o 4/6/7/8 Propeller o Solenoid The TIM Barrel Fold A Conceptual Problem ... Fold versus Topology
Another example: Globin vs. Colicin PDB Protein Database http://www.rcsb.org/pdb/
• Protein DataBase o Multiple Structure Viewers o Sequence & Structure Comparison Tools o Derived Data . SCOP . CATH . pFAM . Go Terms o Education on Protein Structure o Download Structures and Entire Database PDB Protein Database http://www.rcsb.org/pdb/ PDB Protein Database http://www.rcsb.org/pdb/ PDB Advanced Search for UniProt Entry http://www.rcsb.org/pdb/ PDB Search Results http://www.rcsb.org/pdb/ PDB E. coli Hu Entry http://www.rcsb.org/pdb/explore/explore.do?structureId=2O97 PDB SimpleViewer http://www.rcsb.org/pdb/ PDB Protein Workshop View http://www.rcsb.org/pdb/ PDB Derived Data http://www.rcsb.org/pdb/ Molecule of the Month: Enhanceosome http://www.rcsb.org/pdb/static.do?p=education_discussion/molecule_of_the_month/current_month.html NCBI Structure Database http://www.ncbi.nlm.nih.gov/Structure/
• Macromolecular Structures • Related Structures • View Aligned Structures & Sequences • Cn3D: Downloadable Structure & Sequence Viewer • CDD: Conserved Domain Database o CD-Search: Protein Sequence Queries o CD-TREE: Protein Classification Downloadable Application o CDART: Conserved Domain Architecture Tool • PubChem: Small Molecules and Biological Activity • Biological Systems: BioCyc, KEGG and Reactome Pathways • MMDB: Molecular Modeling Database • CBLAST: BLAST sequence against PDB and Related Structure Database • IBIS: Inferred Biomolecular Interaction Server • VAST Search: Structure Alignment Tool NCBI Structure Database http://www.ncbi.nlm.nih.gov/Structure/ NCBI Structure Database http://www.ncbi.nlm.nih.gov/Structure/ NCBI Cn3D Viewer http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml PyMol PDB Structure Viewer http://www.pymol.org/ Databases of Protein Folds
• SCOP (http://scop.berkeley.edu/) o Structural Classification of Proteins o Class-Fold-Superfamily-Family o Manual assembly by inspection • Superfamily (http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/) o HMM models for each SCOP fold o Fold assignments to all genome ORFs o Assessment of specificity/sensitivity of structure prediction o Search by sequence, genome and keywords • CATH (http://www.biochem.ucl.ac.uk/bsm/cath/) o Class - Architecture - Topology - Homologous Superfamily o Manual classification at Architecture level o Automated topology classification using SSAP (Orengo & Taylor) • FSSP (http://www2.embl-ebi.ac.uk/dali/fssp/) o Fully automated using the DALI algorithm (Holm & Sander) o No internal node annotations o Structural similarity search using DALI SCOP Database of Protein Folds http://scop.berkeley.edu/ SCOP Hierarchy http://scop.berkeley.edu/data/scop.b.html SCOP Alpha and Beta Proteins http://scop.berkeley.edu/data/scop.b.d.html SCOP TIM Barrels http://scop.berkeley.edu/data/scop.b.d.b.html SCOP Thiamin Phosphate Synthase http://scop.berkeley.edu/data/scop.b.d.b.d.A.html SCOP Thiamin Phosphate Synthase Entry http://scop.berkeley.edu/ SuperFamily HMM Fold Library http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ SuperFamily Major Features http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ Genome Assignments by Superfamily http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ Databases of Protein Folds
• SCOP (http://scop.berkeley.edu/) o Structural Classification of Proteins o Class-Fold-Superfamily-Family o Manual assembly by inspection • Superfamily (http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/) o HMM models for each SCOP fold o Fold assignments to all genome ORFs o Assessment of specificity/sensitivity of structure prediction o Search by sequence, genome and keywords • CATH (http://www.biochem.ucl.ac.uk/bsm/cath/) o Class - Architecture - Topology - Homologous Superfamily o Manual classification at Architecture level o Automated topology classification using SSAP (Orengo & Taylor) • FSSP (http://www2.embl-ebi.ac.uk/dali/fssp/) o Fully automated using the DALI algorithm (Holm & Sander) o No internal node annotations o Structural similarity search using DALI CATH Protein Structure Classification http://www.biochem.ucl.ac.uk/bsm/cath/ CATH Protein Structure Hierarchy http://www.biochem.ucl.ac.uk/bsm/cath/ CATH Protein Class Level http://www.biochem.ucl.ac.uk/bsm/cath/ CATH Orthogonal Bundle http://www.biochem.ucl.ac.uk/bsm/cath/ CATH Protein Summary http://www.biochem.ucl.ac.uk/bsm/cath/ CATH Protein Summary http://www.biochem.ucl.ac.uk/bsm/cath/ Databases of Protein Folds
• SCOP (http://scop.berkeley.edu/) o Structural Classification of Proteins o Class-Fold-Superfamily-Family o Manual assembly by inspection • Superfamily (http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/) o HMM models for each SCOP fold o Fold assignments to all genome ORFs o Assessment of specificity/sensitivity of structure prediction o Search by sequence, genome and keywords • CATH (http://www.biochem.ucl.ac.uk/bsm/cath/) o Class - Architecture - Topology - Homologous Superfamily o Manual classification at Architecture level o Automated topology classification using SSAP (Orengo & Taylor) • FSSP (http://www2.embl-ebi.ac.uk/dali/fssp/) o Fully automated using the DALI algorithm (Holm & Sander) o No internal node annotations o Structural similarity search using DALI FSSP Database http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+LibInfo+-lib+FSSP Dali Server http://www.ebi.ac.uk/dali/ DALI Database (Liisa Holm) http://ekhidna.biocenter.helsinki.fi/dali/start Protein Fold Prediction: Swiss Model http://swissmodel.expasy.org/
• Amos Bairoch, Swiss Bioinformatics Institute, SBI • Threading and Template Discovery • Workspace for saving Template Results • Domain Annotation • Structure Assessment • Template Library • Structures & Models • Documentation and Tutorials Protein Fold Prediction: Swiss Model http://swissmodel.expasy.org/ Automatic Protein Fold Prediction http://swissmodel.expasy.org/ Automatic Protein Fold Prediction Results http://swissmodel.expasy.org/ Automatic Protein Fold Prediction Results http://swissmodel.expasy.org/ Automatic Protein Fold Prediction Results http://swissmodel.expasy.org/ Protein Fold Prediction: phyre http://www.sbg.bio.ic.ac.uk/~phyre/
• Michael Sternberg, Structural Bioinformatics Group, Imperial College London • Protein structure prediction on the web: a case study using the Phyre server Kelley LA and Sternberg MJE. Nature Protocols 4, 363 - 371 (2009) • Protein Homology/analogY Recognition Engine Protein Fold Prediction: phyre http://www.sbg.bio.ic.ac.uk/~phyre/ Protein Fold Prediction: phyre http://www.sbg.bio.ic.ac.uk/~phyre/ Protein Fold Prediction: phyre http://www.sbg.bio.ic.ac.uk/~phyre/ Protein Fold Prediction: PsiPred http://bioinf4.cs.ucl.ac.uk:3000/psipred/
• Kevin Bryson and David Jones, University College London • Predicts Secondary Structure of single molecules • Predicts Transmembrane Topology • Three Fold Recognition methods Protein Fold Prediction: PsiPred http://bioinf4.cs.ucl.ac.uk:3000/psipred/ Protein Fold Prediction: PsiPred http://bioinf4.cs.ucl.ac.uk:3000/psipred/ Protein Fold Prediction: Predict Protein http://www.predictprotein.org/
• Burkhard Rost, Columbia • Methods o MaxHom : multiple alignment o PSI-BLAST : iterated profile o searchProSite : functional motifs o SEG : composition-bias o ProDom : domain assignment o PredictNLS : nuclear localisation signal o PHDsec : secondary structure o PHDacc : solvent accessibility o Globe : globularity of proteins o PHDhtm : transmembrane helices o PROFsec : secondary structure o PROFacc : solvent accessibilityCoils : coiled-coil regions o CYSPRED : cysteine bridges o Topits : fold recognition by threading Protein Fold Prediction: Predict Protein http://www.predictprotein.org/ Automating Structure Classification, Fold & Function Detection
• Growth of PDB demands automated techniques for classification and fold detection • Protein Structure Comparison o computing structure similarity based on metrics (distances) o identifying protein function o understanding functional mechanism o identifying structurally conserved regions in the protein o finding binding sites or other functionally important regions of the protein Structure Superposition
• Find the transformation matrix that best overlaps the table and the chair • i.e. Find the transformation matrix that minimizes the root mean square deviation between corresponding points of the table and the chair • Correspondences: o Top of chair to top of table o Front of chair to front of table, etc. Absolute Orientation Algorithm http://www-mtl.mit.edu/researchgroups/itrc/ITRC_publication/horn_publications.html
Closed-form solution of absolute orientation using unit quaternions
Berthold K.P. Horn, J.Opt.Soc.Am, + April 1987, Vol 4, No. 4
The key is finding corresponding points between the two structures Algorithms for Structure Superposition
• Distance based methods: DALI (Holm & Sander): Aligning scalar distance plots STRUCTAL (Gerstein & Levitt): Dynamic programming using pair-wise inter-molecular distances SSAP (Orengo & Taylor): Dynamic programming using intra-molecular vector distances MINAREA (Falicov and Cohen): Minimizing soap-bubble surface area CE (Shindyalov & Bourne) • Vector based methods: VAST (Bryant): Graph theory based secondary structure alignment 3D Search (Singh and Brutlag) & 3D Lookup (Holm and Sander): Fast secondary structure index lookup • Both LOCK (Singh & Brutlag) LOCK2 (Ebert & Brutlag): Hierarchically uses both secondary structure vectors and atomic distances DALI
An intra-molecular distance plot for myoglobin DALI
• Based on aligning 2-D intra-molecular distance matrices • Computes the best subset of corresponding residues from the two proteins such that the similarity between the 2-D distance matrices is maximized • Searches through all possible alignments of residues using Monte-Carlo and Branch-and- Bound algorithms
Score(i, j) = 1.5 - |distanceA(i, j) - distanceB(i, j)| STRUCTAL
• Based on Iterative Dynamic Programming to align inter-molecular distances • Pair-wise alignment score in each square of the matrix is inversely proportional to distance between the two atoms
12 3 4 5 6 1 2 3 4 5 6 1 1 2 2 3 3 4 4 5 5 6 6 VAST - Vector Alignment Search Tool
• Aligns only secondary structure elements (SSE)
• Represents each SSE as a vector
• Finds all possible pairs of vectors from the two structures that are similar
• Uses a graph theory algorithm to find maximal subset of similar vector pairs
• Overall alignment score is based on the number of similar pairs of vectors between the two structures Algorithms for Structure Superposition • Atomic distance based methods: DALI (Holm and Sander): Aligning scalar distance plots STRUCTAL (Gerstein and Levitt): Dynamic programming using pair wise inter-molecular distances SSAP (Orengo and Taylor): Dynamic programming using intra-molecular vector distances MINAREA (Falicov and Cohen): Minimizing soap-bubble surface area • Vector based methods: VAST (Bryant): Graph theory based secondary structure alignment 3dSearch (Singh and Brutlag): Fast secondary structure index lookup • Use both SSE vectors and atomic distances LOCK (Singh and Brutlag): Hierarchically uses both secondary structure vectors and atomic distances LOCK - Creating Secondary Structure Vectors Comparing Secondary Structure Vectors
θ Orientation Independent Scores: k i S = S(|angle θ(i,k) - angle φ(p,r)|) S = S(|distance(i,k) - distance(p,r)|) S = S(|length(i) - length(p)|)+ φ S(|length(p) - length(r)|) p r Orientation Dependent Scores: S = S(angle(k,r)) S = S(distance(k,r))
M 2 M M 2 - d S(d) = d 1 + d0 d 0 - M Aligning Secondary Structure Vectors
H H S S S Best local alignment : H HHSS S SHSSH S H Three Step Algorithm
• Local Secondary Structure Superposition o Find an initial superposition of the two proteins by using dynamic programming to align the secondary structure vectors
• Atomic Superposition o Apply a greedy nearest neighbor method to minimize the RMSD between the C-α atoms from query and the target (i.e. find the nearest local minimum in the alignment space)
• Core Superposition o Find the best sequential core of aligned C-α atoms and minimize the RMSD between them Step 1: Local Secondary Structure Superposition
H3 S4 S4 S2 H1 H3 H1 S2 Step 1: Local Secondary Structure Superposition
B3 A4 B4 A2 A1 B1 A3 B2
pair # of aligned vectors total alignment score
A1,A2 2 32 B2,B3 A3,A4 3 71 B3,B4 Step 1: Local Secondary Structure Superposition Step 2: Atomic Superposition Step 3: Core Superposition LOCK 2: Secondary Structure Element Alignment
θ
φ ψ Ф d
Superimpose vectors and θ Compare internal distances in RsRceeopsrreteo sareli ngstne smceoecnnodtn audrsayi rnsygt r ubcottuhr e ψ order to find equivalent φ soetrrlieuemcntteuanrteito renel epinmrdeesnpetensnt aadtseio nvnte acntodr s secondary structure elements d orientation dependent scores Residue Alignment
EEKSAVTALWGKV-- GDKKAINKIWPKIYK
superposition residue registration
• Naïve approach: Nearest neighbor alpha carbons Beta Carbons Encode Directional Information
θ = Angle between Cα and
Cβ vectors d = distance between Cβ atom (maximum 6Ǻ) New Residue Alignment
Cβ Improvements in Consistency • Consistency: measures the adherence to the transitivity property among all triples of protein structures in a given superfamily
Globin Immunoglobulin Superfa Superfamily mily
Alpha carbon 74.3% 58.6% distances
Beta carbon 80% 59.9% positions
% increase in 37.0% 77.8% aligned residues
(less than 10% pairwise sequence identity) New LOCK 2 Properties
• Changes to secondary structure element alignment phase allow for recognition of more distant structural relationships • Metric scoring function: 1-score(A,B) + 1-score(B,C) ≤ 1-score(A,C) • Biologically relevant residue alignment • Highly consistent alignments • Symmetric • Assessment of statistical significance FoldMiner: Structure Similarity Search Based on LOCK2 Alignment
• FoldMiner aligns query structure with all database structures using LOCK2
• FoldMiner up weights secondary structure elements in query that are aligned more often
• FoldMiner outperforms CE and VAST is searches for structure similarity Receiver-Operating Characteristic (ROC) Curves
Ra Fold 16 14 nk 12 10 1 Immunoglobuli 8 6 n Po4sitives Numb2er of True 2 Immunoglobuli 0 . .
0 5 10 15 . .
. . n Number of False Positives 3 p53 • Gold standard: Structural Classification of Proteins (SCOP) o SCOP folds: similar arrangement and connectivity of secondary structure elements Comparing ROC Curves
s 40 e v i 35 t • Area under the ROC i s 30 o curve correlates with P 25 e u 20 the property of ranking r
T 15
CE f true positives ahead of o 10
VAST r e 5 FoldMiner false positives b 0 m
u • 0 100 200 300 Curves may terminate N 25 at different numbers of
20 true and false positives
15 • Areas can only be
10 directly compared if VAST calculated at points 5 CE FoldMiner where the two curves 0 0 5 10 15 20 cross over one another Number of False Positives Comprehensive Analysis of ROC Curves Motif Alignment Results
Families Superfamilies eMOTIFs 96.4% 91.6%
Prosite patterns 97.4% 92.6% LOCK2 Superposition Web Site http://brutlag.stanford.edu/lock2/ LOCK2 Superposition Web Site http://brutlag.stanford.edu/lock2/ PyMol Display of LOCK2 Superposition FoldMiner Structure Search http://brutlag.stanford.edu/foldminer/ FoldMiner Myoglobin Structure Search http://brutlag.stanford.edu/foldminer/ FoldMiner Myoglobin Structure Search http://brutlag.stanford.edu/foldminer/ FoldMiner Myoglobin Structure Search http://brutlag.stanford.edu/foldminer/ FoldMiner Myoglobin Structure Search http://brutlag.stanford.edu/foldminer/ ModLink+ http://sbi.imim.es/modlink/