Computational Molecular Biology Biochem 218 – BioMedical Informatics 231 http://biochem218.stanford.edu/

Protein Structural Motifs

Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Homework 5: Phylogenies

• For this homework assignment take 20 to 30 protein sequences which are at least 30% similar or better and: o 1) make a multiple sequence alignment with them using ClustalW and o 2) make two phylogenies, one using UPGMA method and the other using the Neighbor Joining method • Describe the resulting alignments and include graphic images of the phylogenies in a message to [email protected] • Mention if the trees seem reasonable biologically or taxonomically by comparison with standard taxonomies • Do the two trees have the same topology? • Do the trees have the same branch lengths? • If the two trees do not have the same topology or branch lengths, describe the differences and indicate why you think the two trees differ. Are the differences significant? • Do the trees show evidence of paralogous evolution? Which nodes are orthologous and which are paralogous bifurcations? • Do the trees show evidence of either gene conversion or horizontal gene transfer? Final Projects Due March 12

• Examples of Previous Final Projects o http://biochem218.stanford.edu/Projects.html • Critical review of any area of computational molecular biology. o Area from the lectures but in more depth o Any other area of or genomics focused on computational approaches • Proposed improvement or novel approach • Can be a combined experimental/computational method. • Could be an implementation or just pseudocode. • Please do a MeSH literature search for Reviews on your topic. Some useful MeSH terms include: o Algorithms o Statistics o Molecular Sequence Data o Molecular Structure etc. • Please send a proposed final project topic to [email protected] by next Friday Computational Goals

• Compare all known structures to each other • Compute distances between protein structures • Classify and organize all structures in a biologically meaningful way • Discover conserved substructure domain • Discover conserved substructural motifs • Find common folding patterns and structural/functional motifs • Discover relationship between structure and function. • Study interactions between proteins and other proteins, ligands and DNA (Protein Docking) • Use known structures and folds to infer structure from sequence (Protein Threading) • Use known structural motifs to infer function from structure • Many more… Structural Classification of Proteins (SCOP) http://scop.berkeley.edu/

• Class o Similar secondary structure content o All α, all β, alternating α/ β etc

• Fold (Architecture) o Major structural similarity o SSE’s in similar arrangement

• Superfamily (Topology) o Probable common ancestry o HMM family membership

• Family o Clear evolutionary relationship o Pairwise sequence similarity > 25% Classes of Protein Structures

• Mainly α • Mainly β  α β alternating o Parallel β sheets, β-α-β units

• α β o Anti-parallel β sheets, segregated α and β regions o helices mostly on one side of sheet Classes of Protein Structures

• Others o Multi-domain, membrane and cell surface, small proteins, peptides and fragments, designed proteins Folds / Architectures

• Mainly α • α/ βa nd α+β o Bundle • Closed o Non-Bundle • • Mainly β Barrel o Single sheet • Roll, ... o Roll • Open o Barrel • Sandwich o Clam • Clam, ... o Sandwich o Prism o 4/6/7/8 Propeller o Solenoid The TIM Barrel Fold A Conceptual Problem ... Fold versus Topology

Another example: Globin vs. Colicin PDB Protein Database http://www.rcsb.org/pdb/

• Protein DataBase o Multiple Structure Viewers o Sequence & Structure Comparison Tools o Derived Data . SCOP . CATH . pFAM . Go Terms o Education on Protein Structure o Download Structures and Entire Database PDB Protein Database http://www.rcsb.org/pdb/ PDB Protein Database http://www.rcsb.org/pdb/ PDB Advanced Search for UniProt Entry http://www.rcsb.org/pdb/ PDB Search Results http://www.rcsb.org/pdb/ PDB E. coli Hu Entry http://www.rcsb.org/pdb/explore/explore.do?structureId=2O97 PDB SimpleViewer http://www.rcsb.org/pdb/ PDB Protein Workshop View http://www.rcsb.org/pdb/ PDB Derived Data http://www.rcsb.org/pdb/ Molecule of the Month: Enhanceosome http://www.rcsb.org/pdb/static.do?p=education_discussion/molecule_of_the_month/current_month.html NCBI Structure Database http://www.ncbi.nlm.nih.gov/Structure/

• Macromolecular Structures • Related Structures • View Aligned Structures & Sequences • Cn3D: Downloadable Structure & Sequence Viewer • CDD: Conserved Domain Database o CD-Search: Protein Sequence Queries o CD-TREE: Protein Classification Downloadable Application o CDART: Conserved Domain Architecture Tool • PubChem: Small Molecules and Biological Activity • Biological Systems: BioCyc, KEGG and Reactome Pathways • MMDB: Molecular Modeling Database • CBLAST: BLAST sequence against PDB and Related Structure Database • IBIS: Inferred Biomolecular Interaction Server • VAST Search: Structure Alignment Tool NCBI Structure Database http://www.ncbi.nlm.nih.gov/Structure/ NCBI Structure Database http://www.ncbi.nlm.nih.gov/Structure/ NCBI Cn3D Viewer http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml PyMol PDB Structure Viewer http://www.pymol.org/ Databases of Protein Folds

• SCOP (http://scop.berkeley.edu/) o Structural Classification of Proteins o Class-Fold-Superfamily-Family o Manual assembly by inspection • Superfamily (http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/) o HMM models for each SCOP fold o Fold assignments to all genome ORFs o Assessment of specificity/sensitivity of structure prediction o Search by sequence, genome and keywords • CATH (http://www.biochem.ucl.ac.uk/bsm/cath/) o Class - Architecture - Topology - Homologous Superfamily o Manual classification at Architecture level o Automated topology classification using SSAP (Orengo & Taylor) • FSSP (http://www2.embl-ebi.ac.uk/dali/fssp/) o Fully automated using the DALI algorithm (Holm & Sander) o No internal node annotations o Structural similarity search using DALI SCOP Database of Protein Folds http://scop.berkeley.edu/ SCOP Hierarchy http://scop.berkeley.edu/data/scop.b.html SCOP Alpha and Beta Proteins http://scop.berkeley.edu/data/scop.b.d.html SCOP TIM Barrels http://scop.berkeley.edu/data/scop.b.d.b.html SCOP Thiamin Phosphate Synthase http://scop.berkeley.edu/data/scop.b.d.b.d.A.html SCOP Thiamin Phosphate Synthase Entry http://scop.berkeley.edu/ SuperFamily HMM Fold Library http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ SuperFamily Major Features http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ Genome Assignments by Superfamily http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ Databases of Protein Folds

• SCOP (http://scop.berkeley.edu/) o Structural Classification of Proteins o Class-Fold-Superfamily-Family o Manual assembly by inspection • Superfamily (http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/) o HMM models for each SCOP fold o Fold assignments to all genome ORFs o Assessment of specificity/sensitivity of structure prediction o Search by sequence, genome and keywords • CATH (http://www.biochem.ucl.ac.uk/bsm/cath/) o Class - Architecture - Topology - Homologous Superfamily o Manual classification at Architecture level o Automated topology classification using SSAP (Orengo & Taylor) • FSSP (http://www2.embl-ebi.ac.uk/dali/fssp/) o Fully automated using the DALI algorithm (Holm & Sander) o No internal node annotations o Structural similarity search using DALI CATH Protein Structure Classification http://www.biochem.ucl.ac.uk/bsm/cath/ CATH Protein Structure Hierarchy http://www.biochem.ucl.ac.uk/bsm/cath/ CATH Protein Class Level http://www.biochem.ucl.ac.uk/bsm/cath/ CATH Orthogonal Bundle http://www.biochem.ucl.ac.uk/bsm/cath/ CATH Protein Summary http://www.biochem.ucl.ac.uk/bsm/cath/ CATH Protein Summary http://www.biochem.ucl.ac.uk/bsm/cath/ Databases of Protein Folds

• SCOP (http://scop.berkeley.edu/) o Structural Classification of Proteins o Class-Fold-Superfamily-Family o Manual assembly by inspection • Superfamily (http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/) o HMM models for each SCOP fold o Fold assignments to all genome ORFs o Assessment of specificity/sensitivity of structure prediction o Search by sequence, genome and keywords • CATH (http://www.biochem.ucl.ac.uk/bsm/cath/) o Class - Architecture - Topology - Homologous Superfamily o Manual classification at Architecture level o Automated topology classification using SSAP (Orengo & Taylor) • FSSP (http://www2.embl-ebi.ac.uk/dali/fssp/) o Fully automated using the DALI algorithm (Holm & Sander) o No internal node annotations o Structural similarity search using DALI FSSP Database http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+LibInfo+-lib+FSSP Dali Server http://www.ebi.ac.uk/dali/ DALI Database (Liisa Holm) http://ekhidna.biocenter.helsinki.fi/dali/start Protein Fold Prediction: Swiss Model http://swissmodel.expasy.org/

• Amos Bairoch, Swiss Bioinformatics Institute, SBI • Threading and Template Discovery • Workspace for saving Template Results • Domain Annotation • Structure Assessment • Template Library • Structures & Models • Documentation and Tutorials Protein Fold Prediction: Swiss Model http://swissmodel.expasy.org/ Automatic Protein Fold Prediction http://swissmodel.expasy.org/ Automatic Protein Fold Prediction Results http://swissmodel.expasy.org/ Automatic Protein Fold Prediction Results http://swissmodel.expasy.org/ Automatic Protein Fold Prediction Results http://swissmodel.expasy.org/ Protein Fold Prediction: http://www.sbg.bio.ic.ac.uk/~phyre/

• Michael Sternberg, Structural Bioinformatics Group, • Protein structure prediction on the web: a case study using the Phyre server Kelley LA and Sternberg MJE. Nature Protocols 4, 363 - 371 (2009) • Protein Homology/analogY Recognition Engine Protein Fold Prediction: phyre http://www.sbg.bio.ic.ac.uk/~phyre/ Protein Fold Prediction: phyre http://www.sbg.bio.ic.ac.uk/~phyre/ Protein Fold Prediction: phyre http://www.sbg.bio.ic.ac.uk/~phyre/ Protein Fold Prediction: PsiPred http://bioinf4.cs.ucl.ac.uk:3000/psipred/

• Kevin Bryson and David Jones, University College London • Predicts Secondary Structure of single molecules • Predicts Transmembrane Topology • Three Fold Recognition methods Protein Fold Prediction: PsiPred http://bioinf4.cs.ucl.ac.uk:3000/psipred/ Protein Fold Prediction: PsiPred http://bioinf4.cs.ucl.ac.uk:3000/psipred/ Protein Fold Prediction: Predict Protein http://www.predictprotein.org/

• Burkhard Rost, Columbia • Methods o MaxHom : multiple alignment o PSI-BLAST : iterated profile o searchProSite : functional motifs o SEG : composition-bias o ProDom : domain assignment o PredictNLS : nuclear localisation signal o PHDsec : secondary structure o PHDacc : solvent accessibility o Globe : globularity of proteins o PHDhtm : transmembrane helices o PROFsec : secondary structure o PROFacc : solvent accessibilityCoils : coiled-coil regions o CYSPRED : cysteine bridges o Topits : fold recognition by threading Protein Fold Prediction: Predict Protein http://www.predictprotein.org/ Automating Structure Classification, Fold & Function Detection

• Growth of PDB demands automated techniques for classification and fold detection • Protein Structure Comparison o computing structure similarity based on metrics (distances) o identifying protein function o understanding functional mechanism o identifying structurally conserved regions in the protein o finding binding sites or other functionally important regions of the protein Structure Superposition

• Find the transformation matrix that best overlaps the table and the chair • i.e. Find the transformation matrix that minimizes the root mean square deviation between corresponding points of the table and the chair • Correspondences: o Top of chair to top of table o Front of chair to front of table, etc. Absolute Orientation Algorithm http://www-mtl.mit.edu/researchgroups/itrc/ITRC_publication/horn_publications.html

Closed-form solution of absolute orientation using unit quaternions

Berthold K.P. Horn, J.Opt.Soc.Am, + April 1987, Vol 4, No. 4

The key is finding corresponding points between the two structures Algorithms for Structure Superposition

• Distance based methods:  DALI (Holm & Sander): Aligning scalar distance plots  STRUCTAL (Gerstein & Levitt): Dynamic programming using pair-wise inter-molecular distances  SSAP (Orengo & Taylor): Dynamic programming using intra-molecular vector distances  MINAREA (Falicov and Cohen): Minimizing soap-bubble surface area  CE (Shindyalov & Bourne) • Vector based methods:  VAST (Bryant): Graph theory based secondary structure alignment  3D Search (Singh and Brutlag) & 3D Lookup (Holm and Sander): Fast secondary structure index lookup • Both  LOCK (Singh & Brutlag) LOCK2 (Ebert & Brutlag): Hierarchically uses both secondary structure vectors and atomic distances DALI

An intra-molecular distance plot for myoglobin DALI

• Based on aligning 2-D intra-molecular distance matrices • Computes the best subset of corresponding residues from the two proteins such that the similarity between the 2-D distance matrices is maximized • Searches through all possible alignments of residues using Monte-Carlo and Branch-and- Bound algorithms

Score(i, j) = 1.5 - |distanceA(i, j) - distanceB(i, j)| STRUCTAL

• Based on Iterative Dynamic Programming to align inter-molecular distances • Pair-wise alignment score in each square of the matrix is inversely proportional to distance between the two atoms

12 3 4 5 6 1 2 3 4 5 6 1 1 2 2 3 3 4 4 5 5 6 6 VAST - Vector Alignment Search Tool

• Aligns only secondary structure elements (SSE)

• Represents each SSE as a vector

• Finds all possible pairs of vectors from the two structures that are similar

• Uses a graph theory algorithm to find maximal subset of similar vector pairs

• Overall alignment score is based on the number of similar pairs of vectors between the two structures Algorithms for Structure Superposition • Atomic distance based methods:  DALI (Holm and Sander): Aligning scalar distance plots  STRUCTAL (Gerstein and Levitt): Dynamic programming using pair wise inter-molecular distances  SSAP (Orengo and Taylor): Dynamic programming using intra-molecular vector distances  MINAREA (Falicov and Cohen): Minimizing soap-bubble surface area • Vector based methods:  VAST (Bryant): Graph theory based secondary structure alignment  3dSearch (Singh and Brutlag): Fast secondary structure index lookup • Use both SSE vectors and atomic distances  LOCK (Singh and Brutlag): Hierarchically uses both secondary structure vectors and atomic distances LOCK - Creating Secondary Structure Vectors Comparing Secondary Structure Vectors

θ Orientation Independent Scores: k i S = S(|angle θ(i,k) - angle φ(p,r)|) S = S(|distance(i,k) - distance(p,r)|) S = S(|length(i) - length(p)|)+ φ S(|length(p) - length(r)|) p r Orientation Dependent Scores: S = S(angle(k,r)) S = S(distance(k,r))

M 2 M M 2 - d S(d) = d 1 + d0 d 0 - M Aligning Secondary Structure Vectors

H H S S S Best local alignment : H HHSS S SHSSH S H Three Step Algorithm

• Local Secondary Structure Superposition o Find an initial superposition of the two proteins by using dynamic programming to align the secondary structure vectors

• Atomic Superposition o Apply a greedy nearest neighbor method to minimize the RMSD between the C-α atoms from query and the target (i.e. find the nearest local minimum in the alignment space)

• Core Superposition o Find the best sequential core of aligned C-α atoms and minimize the RMSD between them Step 1: Local Secondary Structure Superposition

H3 S4 S4 S2 H1 H3 H1 S2 Step 1: Local Secondary Structure Superposition

B3 A4 B4 A2 A1 B1 A3 B2

pair # of aligned vectors total alignment score

A1,A2 2 32 B2,B3 A3,A4 3 71 B3,B4 Step 1: Local Secondary Structure Superposition Step 2: Atomic Superposition Step 3: Core Superposition LOCK 2: Secondary Structure Element Alignment

θ

φ ψ Ф d

Superimpose vectors and θ Compare internal distances in RsRceeopsrreteo sareli ngstne smceoecnnodtn audrsayi rnsygt r ubcottuhr e ψ order to find equivalent φ soetrrlieuemcntteuanrteito renel epinmrdeesnpetensnt aadtseio nvnte acntodr s secondary structure elements d orientation dependent scores Residue Alignment

EEKSAVTALWGKV-- GDKKAINKIWPKIYK

superposition residue registration

• Naïve approach: Nearest neighbor alpha carbons Beta Carbons Encode Directional Information

θ = Angle between Cα and

Cβ vectors d = distance between Cβ atom (maximum 6Ǻ) New Residue Alignment

Cβ Improvements in Consistency • Consistency: measures the adherence to the transitivity property among all triples of protein structures in a given superfamily

Globin Immunoglobulin Superfa Superfamily mily

Alpha carbon 74.3% 58.6% distances

Beta carbon 80% 59.9% positions

% increase in 37.0% 77.8% aligned residues

(less than 10% pairwise sequence identity) New LOCK 2 Properties

• Changes to secondary structure element alignment phase allow for recognition of more distant structural relationships • Metric scoring function: 1-score(A,B) + 1-score(B,C) ≤ 1-score(A,C) • Biologically relevant residue alignment • Highly consistent alignments • Symmetric • Assessment of statistical significance FoldMiner: Structure Similarity Search Based on LOCK2 Alignment

• FoldMiner aligns query structure with all database structures using LOCK2

• FoldMiner up weights secondary structure elements in query that are aligned more often

• FoldMiner outperforms CE and VAST is searches for structure similarity Receiver-Operating Characteristic (ROC) Curves

Ra Fold 16 14 nk 12 10 1 Immunoglobuli 8 6 n Po4sitives Numb2er of True 2 Immunoglobuli 0 . .

0 5 10 15 . .

. . n Number of False Positives 3 p53 • Gold standard: Structural Classification of Proteins (SCOP) o SCOP folds: similar arrangement and connectivity of secondary structure elements Comparing ROC Curves

s 40 e v i 35 t • Area under the ROC i s 30 o curve correlates with P 25 e u 20 the property of ranking r

T 15

CE f true positives ahead of o 10

VAST r e 5 FoldMiner false positives b 0 m

u • 0 100 200 300 Curves may terminate N 25 at different numbers of

20 true and false positives

15 • Areas can only be

10 directly compared if VAST calculated at points 5 CE FoldMiner where the two curves 0 0 5 10 15 20 cross over one another Number of False Positives Comprehensive Analysis of ROC Curves Motif Alignment Results

Families Superfamilies eMOTIFs 96.4% 91.6%

Prosite patterns 97.4% 92.6% LOCK2 Superposition Web Site http://brutlag.stanford.edu/lock2/ LOCK2 Superposition Web Site http://brutlag.stanford.edu/lock2/ PyMol Display of LOCK2 Superposition FoldMiner Structure Search http://brutlag.stanford.edu/foldminer/ FoldMiner Myoglobin Structure Search http://brutlag.stanford.edu/foldminer/ FoldMiner Myoglobin Structure Search http://brutlag.stanford.edu/foldminer/ FoldMiner Myoglobin Structure Search http://brutlag.stanford.edu/foldminer/ FoldMiner Myoglobin Structure Search http://brutlag.stanford.edu/foldminer/ ModLink+ http://sbi.imim.es/modlink/