<<

Sequence Alignments

Felix Sappelt Irina Wagner Table of Content

‡ Pairwise Alignments ‡ Multiple Alignments ± FASTA ± ClustalW ± BLAST ± MAFFT ± HHSEARCH ± Muscle ± Cobalt ± T-Coffee ± 3D-Coffee ± JalView PAIRWISE SEQUENCE ALIGNMENTS Pairwise Methods

‡ Dynamic Programming  Global alignment (Needleman-Wunsch)  Local alignment (Smith-Waterman) ‡ Methods  FASTA  BLAST Heuristic Methods

‡ Try only most likely alignments and skip all others ‡ Much faster than dynamic programming methods, but less sensitive ‡ For large , such as whole genomes, speed is extremely important ‡ In some cases, heuristic methods are the only possibility; exact take too long. FASTA

‡ One of the earliest widely used searching tools (Lipman&Pearson in 1985) ‡ Heuristic method approximating Smith Waterman ‡ Search time is proportional to size of DB Fasta-

‡ Find identical substrings ‡ Re-Score and keep only high-scoring identities ‡ Discard substrings that cannot be easily joined ‡ Optimize using dynamic programming around diagonal

‡ PAM () ± Created by Margaret Dayhoff in 1970 ± Based on an explicit evolutionary model ± PAM1 estimate using 1572 changes in 71 groups of sequences that were at least 85% similar ± PAM 250 (20% SIMILARITY) obtained by multiplying PAM1 by itself 250 times ‡ BLOSUM (Block Substitution Matrix) ± Deals with sequence changes over long timespans ± Based on multiple protein alignments ± Used 500 families of related ± not based on explicit evolutionary model, but from considering all amino acids changes observed in an aligned region from a related family of proteins ‡ when the correct scoring matrix is used, alignment statistics are meaningful PAM250 Matrix BLAST

‡ BLAST is an improvement over FASTA ± Greater speed by pre-indexing the database ± More accurate results ‡ BLAST is the centerpiece of many assays, because it makes genome-scale sequences accessible ‡ The original paper was the most cited paper of the 1990s (Altschul et al. 1990) BLAST

‡ Mask low-complexity regions Over 50% of genomic DNA is repetitive  Retrotransposons  Repeats  ALU regions  Microsatellites  UTRs BLAST

‡ Most widely used program to look at sequence alignments and similarities ‡ Instead of relying on global alignments, BLAST compare by locating short matches between the two sequences ‡ BLAST creats a list of ͞words͟ that have a certain ͞treshold͟ score when compared with the query sequence ‡ The database is searched for occurrences of this words ‡ Uses hashTable that contains neighborhood words BLAST

‡ List all k-tuples in the query sequence ‡ the lower k-tup value the more background you will have ‡ the higher the k-tup value the faster analysis ‡ Find all matching words in the database ‡ Keep only the high-scoring words ! difference to FASTA ‡ Build search tree from remaining words BLAST

‡ Extend the match until the match score decreases or the end of the sequence has been reached ‡ Extended matches are called High-Scoring Segment Pairs BLAST

‡ List all HSPs in the database whose score is high enough to be considered ‡ Assess statistical significance via the Gumbel Extreme Value Distribution, which describes the distribution of Smith-Waterman scores ‡ Join HSPs into a longer alignment ‡ Output BLAST Results

‡ The raw score: is calculated by summing the scores for each aligned position and the scores for gaps ‡ Bit scores: Bit scores are raw scores converted from the log base of the scoring matrix that creates the alignment to log base 2, this rescaling allows scores to be compared between the alignments ‡ E-value: Expected number of chance alignments; the lower the E value, the more significant the score. An expect value of 10.0 is the default value of statistical significance, but this number can be adjusted by the user ‡ P-value: The P-value represents the probablity (in the range of 0-1) of a given sequence occuring by chance. It is less accurate than the E-value Other BLAST variants

BLASTN seqeunece comparison BLASTP General protein comparison

TBLASTN compares a protein sequence to a translated DNA DB Use if homolog not found in protein DB TBLASTN, TBLASTX compares a translated DNA sequence to a translated DNA DB Identify new orthologs in closely related BLASTX Compares a translated nucleotide query to a protein DB Other BLAST variants

PSI-BLAST PHI-BLAST

‡ Position Specific Iterative ‡ Pattern Hit Initiated BLAST BLAST ‡ Uses protein motifs to ‡ is used to find distant relatives of a protein increase the chance of ‡ Easy to use version of a finding biologically ͣprofile͞ search significant matches ‡ Uses an iterative alignment procedur to develop position specific scoring matrices which increases its capability to detect weak pattern matches HHSearch

‡ Represents query and database by profile Hidden Markov Models ‡ Database profiles derived from multiple sequence alignments ‡ Before searching the HMM database, a MSA of related sequences is compiled using CSI- Blast ‡ From this MSA, a profile is calculated ‡ Search is being done with this profile as the query BLAST

‡ Algorithm ‡ Speed: pre-indexing the DB before the search, parallel processing ± Mask low-complexity regions (repeats) ± Make k-subtring wordlist of sequence ± List common words between DB and query; care only about high-scoring (: all; main diff) ± Build efficient search tree ± Repeat 3 and 4 foreach k-letter substring of query ± Scan db sequences for exact matches w/ remaining highscoring words ± Extend exact matches to High Scoring Segment Pairs HSP: verlängern von alignment nach links und rechts bis score sich verschlechtert (Blast; blast2: lower neighborhood score threshold, dadurch wörter länger; spart zeit; da is noch mehr, angucken (wiki)) ± List all HSP in DB whose score is high enough to be considered; use cutoff score S (empirically determined) to find out which ones to consider ± Evaluate HSP score significance using Gumbel Extreme Value Distribution (formel in wiki) ± Two or more HSPs in one db sequence -> make into one alignment; compare significance of newly combined regions using poisson method or sum of scores method ± Original : one alignment per hsp (multiple pairwise alignmnents if more than one hsp found); blast2: one alignment, Smith-Waterman, gapped. ± Report matches with expect score lower than E threshold MULTIPLE SEQUENCE ALIGNMENTS The MSA problem

‡ Correctly align more than two sequences ‡ NP-complete problem ± For k sequences of length n, complexity is O(nk) ± For 10 sequences of length 50, nk is about 1017 ± For 50 sequences of length 500, nk is about 10136 ‡ World͚s biggest supercomputer: 2.5 TFLOPS (1012) ‡ Since Planet Earth will be around for just 6 billion years, all current approaches are heuristic. What are MSAs good for?

‡ Assess evolutionary history and sequence homology of a set of sequences ‡ Useful for ... ± Homology modelling ± Phylogenic research ± Illustrating mutation events and evolutionary processes MSA Workflow Methods

‡ ClustalW: Basic Tree-Based approach (1994) ‡ ProbCons: Probabilistic approach (2005) ‡ MAFFT: Fast Fourier Transformation (2002) ‡ Muscle: K-Substring counting, Profiles (2004) ‡ Cobalt: Proteins, user input (2007) ‡ T-Coffee: Library-Based (2000)

͙ and many more. ClustalW

‡ Published in 1994 ‡ How it works: ± First, do all possible pairwise alignments ± Build a guide tree ‡ Neighbor Joining Method ± Progressively align according to branching order in guide tree ‡ Starting from leafs, build pairwise alignments towards root ClustalW

‡ Pros: ± It͚s fast ± Results are good for highly similar sequences ± Position-specific gaps protect hydrophobic core ‡ Cons: ± Simple approach ± Errors in pairwise alignment stage propagate, cannot be corrected Probcons

‡ Probabilistic ‡ Idea of consistency ± Prevents misalignments due to ͣfaulty͞ pairwise alignments ± Sequences x, y, z: x i if xi aligns with yj, and yj aligns with zk, y j then x aligns with z . i k z k Probcons

‡ Compute posterior probability matrix using HMM ‡ Construct pairwise alignments that maximize ͣexpected accuracy͞ ‡ Probabilistic consistency transformation of posterior matrix ± Incorporate similarity to other sequences into pairwise comparisons ‡ Build guide tree ‡ Progressive alignment MAFFT

‡ Uses Fast Fourier Transformation to identify homologous regions ‡ Uses polarity and volume information for amino acids ‡ Can run in progressive and iterative mode ‡ Extremely fast ‡ Very accurate Muscle

‡ Iterative method ‡ K-mer counting ± Approximate distance between two sequences by number of common k-substrings ± Very fast ‡ Log expectation ± Profile function used to iteratively improve alignments Cobalt

‡ Specializes in Proteins ‡ Designed to exploit three strategies: ± Using biological information by deriving constraints from protein databases ± Using pairwise similarity present in multiple pairs ± Allowing the user to specify regions that are to be aligned T-Coffee

‡ Tree-based Consistency Objective Function for Alignment Evaluation ‡ Cédric Notredame, 2000 ‡ Derives constraints from libraries of pairwise alignments ‡ Slow, but accurate ‡ 3D-Coffee: Extends T-Coffee with structure information from PDB files ‡ Libraries contain pair- wise alignments ‡ Each AA-pair in them is a constraint ‡ Weights: Percent identity of alignments ‡ Fitting a set of weighted constrains onto a MSA is NP-complete ! heuristic solution: Extension What to use

‡ For small numbers of sequences (<20) with relatively high identity (>40%), any tool works ‡ Large number of sequences may require fast methods: MAFFT (progressive) ‡ Low identity (<35%): T-Coffee, Probcons, MAFFT (L-ins-i) ± Low-identity alignments generally don͚t work well ± Long N- or -Terminal extensions: T-Coffee and MAFFT (E-ins-i) ± Using structure information may help: 3D-Coffee What to use MSA Editors ‡ JalView: Java Alignment Editor Sources

‡ http://de.wikipedia.org/wiki/Substitutionsmatrix ‡ http://en.wikipedia.org/wiki/BLAST ‡ William R. Pearson, [5] Rapid and sensitive sequence comparison with FASTP and FASTA, Methods in Enzymology ‡ Stephen F. Altschul, , , Eugene W. Myers, David J. Lipman, Basic local alignment search tool, Journal of Molecular Biology ‡ http://en.wikipedia.org/wiki/HHpred_/_HHsearch ‡ Jimin Pei, Multiple protein sequence alignment, Current Opinion in Structural Biology ‡ Chuong B. Do, Kazutaka Katoh, Protein Multiple Sequence Alignment, Methods in Molecular Biology ‡ Thompson et al, W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice, Nucleic Acids Research ‡ Notredame et al, T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment, Journal of Molecular Biology ‡ Robert C. Edgar, MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research ‡ Chuong et al, ProbCons: Probabilistic Consistency-Based Multiple Sequence Alignment, Genome Research ‡ Katoh et al, MAFFT Version 5: Improvement in accuracy of multiple sequence alignment, Nucleic Acids Research ‡ Thompson et al, A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods, PLoS One ‡ Papadopoulos et al, COBALT: Constraint-based alignment tool for multiple protein sequences, Bioinformatics ‡ Katoh et al, MAFFT: A novel method for rapid multiple sequence alignment based on fast Forier transform, Nucleic Acids Research ‡ multiple alignments (e.g. ClustalW, Probcons, Mafft, Muscle, T- Coffee, Cobalt) and MSA editors (e.g. Jalview) ‡ with special attention to advantages and limitations of theses methods. ‡ Im Multiplen Alignment Teil schaetze ich, dass T-Coffee vielleicht neu ist fuer Euch. Das T-Coffee Tutorial (siehe http://www.tcoffee.org/Documentation/t_coffee/t_coffee_tutorial. htm) hat auch gute Hinweise zum Thema Vor- und Nachteile der verschiedenen Methoden. Zur Info: Ich werde bei T-Coffee wahrscheinlich naechste Woche als "Hausaufgabe" geben, dass alle mal ein Tutorial durcharbeiten, damit sie die praktischen Nutzung schon mal kennenlernen. Trotzdem ist es sicherlich sinnvoll, darueber zu reden, was "hinter den Kulissen" in T-Coffee passiert. Der eigentliche Task soll dann erst nach Eurem Vortrag (ab 17. Mai) bearbeitet werden.