Sequence Alignments
Total Page:16
File Type:pdf, Size:1020Kb
Sequence Alignments Felix Sappelt Irina Wagner Table of Content Pairwise Alignments Multiple Alignments ± FASTA ± ClustalW ± BLAST ± MAFFT ± HHSEARCH ± Muscle ± Cobalt ± T-Coffee ± 3D-Coffee ± JalView PAIRWISE SEQUENCE ALIGNMENTS Pairwise Sequence Alignment Methods Dynamic Programming Global alignment (Needleman-Wunsch) Local alignment (Smith-Waterman) Heuristic Methods FASTA BLAST Heuristic Methods Try only most likely alignments and skip all others Much faster than dynamic programming methods, but less sensitive For large databases, such as whole genomes, speed is extremely important In some cases, heuristic methods are the only possibility; exact algorithms take too long. FASTA One of the earliest widely used database searching tools (Lipman&Pearson in 1985) Heuristic method approximating Smith Waterman Search time is proportional to size of DB Fasta-Algorithm Find identical substrings Re-Score and keep only high-scoring identities Discard substrings that cannot be easily joined Optimize using dynamic programming around diagonal Substitution Matrix PAM (Point Accepted Mutation) ± Created by Margaret Dayhoff in 1970 ± Based on an explicit evolutionary model ± PAM1 estimate using 1572 changes in 71 groups of protein sequences that were at least 85% similar ± PAM 250 (20% SIMILARITY) obtained by multiplying PAM1 by itself 250 times BLOSUM (Block Substitution Matrix) ± Deals with sequence changes over long timespans ± Based on multiple protein alignments ± Used 500 families of related proteins ± not based on explicit evolutionary model, but from considering all amino acids changes observed in an aligned region from a related family of proteins when the correct scoring matrix is used, alignment statistics are meaningful PAM250 Matrix BLAST BLAST is an improvement over FASTA ± Greater speed by pre-indexing the database ± More accurate results BLAST is the centerpiece of many bioinformatics assays, because it makes genome-scale sequences accessible The original paper was the most cited paper of the 1990s (Altschul et al. 1990) BLAST Mask low-complexity regions Over 50% of genomic DNA is repetitive Retrotransposons Repeats ALU regions Microsatellites UTRs BLAST Most widely used program to look at sequence alignments and similarities Instead of relying on global alignments, BLAST compare by locating short matches between the two sequences BLAST creats a list of ͞words͟ that have a certain ͞treshold͟ score when compared with the query sequence The database is searched for occurrences of this words Uses hashTable that contains neighborhood words BLAST List all k-tuples in the query sequence the lower k-tup value the more background you will have the higher the k-tup value the faster analysis Find all matching words in the database Keep only the high-scoring words ! difference to FASTA Build search tree from remaining words BLAST Extend the match until the match score decreases or the end of the sequence has been reached Extended matches are called High-Scoring Segment Pairs BLAST List all HSPs in the database whose score is high enough to be considered Assess statistical significance via the Gumbel Extreme Value Distribution, which describes the distribution of Smith-Waterman scores Join HSPs into a longer alignment Output BLAST Results The raw score: is calculated by summing the scores for each aligned position and the scores for gaps Bit scores: Bit scores are raw scores converted from the log base of the scoring matrix that creates the alignment to log base 2, this rescaling allows scores to be compared between the alignments E-value: Expected number of chance alignments; the lower the E value, the more significant the score. An expect value of 10.0 is the default value of statistical significance, but this number can be adjusted by the user P-value: The P-value represents the probablity (in the range of 0-1) of a given sequence occuring by chance. It is less accurate than the E-value Other BLAST variants BLASTN Nucleotide seqeunece comparison BLASTP General protein comparison TBLASTN compares a protein sequence to a translated DNA DB Use if homolog not found in protein DB TBLASTN, TBLASTX compares a translated DNA sequence to a translated DNA DB Identify new orthologs in closely related species BLASTX Compares a translated nucleotide query to a protein DB Other BLAST variants PSI-BLAST PHI-BLAST Position Specific Iterative Pattern Hit Initiated BLAST BLAST Uses protein motifs to is used to find distant relatives of a protein increase the chance of Easy to use version of a finding biologically ͣprofile͞ search significant matches Uses an iterative alignment procedur to develop position specific scoring matrices which increases its capability to detect weak pattern matches HHSearch Represents query and database by profile Hidden Markov Models Database profiles derived from multiple sequence alignments Before searching the HMM database, a MSA of related sequences is compiled using CSI- Blast From this MSA, a profile is calculated Search is being done with this profile as the query BLAST Algorithm Speed: pre-indexing the DB before the search, parallel processing ± Mask low-complexity regions (repeats) ± Make k-subtring wordlist of sequence ± List common words between DB and query; care only about high-scoring (fasta: all; main diff) ± Build efficient search tree ± Repeat 3 and 4 foreach k-letter substring of query ± Scan db sequences for exact matches w/ remaining highscoring words ± Extend exact matches to High Scoring Segment Pairs HSP: verlängern von alignment nach links und rechts bis score sich verschlechtert (Blast; blast2: lower neighborhood score threshold, dadurch wörter länger; spart zeit; da is noch mehr, angucken (wiki)) ± List all HSP in DB whose score is high enough to be considered; use cutoff score S (empirically determined) to find out which ones to consider ± Evaluate HSP score significance using Gumbel Extreme Value Distribution (formel in wiki) ± Two or more HSPs in one db sequence -> make into one alignment; compare significance of newly combined regions using poisson method or sum of scores method ± Original blast: one alignment per hsp (multiple pairwise alignmnents if more than one hsp found); blast2: one alignment, Smith-Waterman, gapped. ± Report matches with expect score lower than E threshold MULTIPLE SEQUENCE ALIGNMENTS The MSA problem Correctly align more than two sequences NP-complete problem ± For k sequences of length n, complexity is O(nk) ± For 10 sequences of length 50, nk is about 1017 ± For 50 sequences of length 500, nk is about 10136 World͚s biggest supercomputer: 2.5 TFLOPS (1012) Since Planet Earth will be around for just 6 billion years, all current approaches are heuristic. What are MSAs good for? Assess evolutionary history and sequence homology of a set of sequences Useful for ... ± Homology modelling ± Phylogenic research ± Illustrating mutation events and evolutionary processes MSA Workflow Methods ClustalW: Basic Tree-Based approach (1994) ProbCons: Probabilistic approach (2005) MAFFT: Fast Fourier Transformation (2002) Muscle: K-Substring counting, Profiles (2004) Cobalt: Proteins, user input (2007) T-Coffee: Library-Based (2000) ͙ and many more. ClustalW Published in 1994 How it works: ± First, do all possible pairwise alignments ± Build a guide tree Neighbor Joining Method ± Progressively align according to branching order in guide tree Starting from leafs, build pairwise alignments towards root ClustalW Pros: ± It͚s fast ± Results are good for highly similar sequences ± Position-specific gaps protect hydrophobic core Cons: ± Simple approach ± Errors in pairwise alignment stage propagate, cannot be corrected Probcons Probabilistic Idea of consistency ± Prevents misalignments due to ͣfaulty͞ pairwise alignments ± Sequences x, y, z: x i if xi aligns with yj, and yj aligns with zk, y j then x aligns with z . i k z k Probcons Compute posterior probability matrix using HMM Construct pairwise alignments that maximize ͣexpected accuracy͞ Probabilistic consistency transformation of posterior matrix ± Incorporate similarity to other sequences into pairwise comparisons Build guide tree Progressive alignment MAFFT Uses Fast Fourier Transformation to identify homologous regions Uses polarity and volume information for amino acids Can run in progressive and iterative mode Extremely fast Very accurate Muscle Iterative method K-mer counting ± Approximate distance between two sequences by number of common k-substrings ± Very fast Log expectation ± Profile function used to iteratively improve alignments Cobalt Specializes in Proteins Designed to exploit three strategies: ± Using biological information by deriving constraints from protein databases ± Using pairwise similarity present in multiple pairs ± Allowing the user to specify regions that are to be aligned T-Coffee Tree-based Consistency Objective Function for Alignment Evaluation Cédric Notredame, 2000 Derives constraints from libraries of pairwise alignments Slow, but accurate 3D-Coffee: Extends T-Coffee with structure information from PDB files Libraries contain pair- wise alignments Each AA-pair in them is a constraint Weights: Percent identity of alignments Fitting a set of weighted constrains onto a MSA is NP-complete ! heuristic solution: Extension What to use For small numbers of sequences (<20) with relatively high identity (>40%), any tool works Large number of sequences may require fast methods: MAFFT (progressive)