Lecture 5: Multiple Sequence Alignment
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Bioinformatics for Computer Scientists Lecture 5 Advertisement I ● Next round of excellence initiative ● Algorithm Engineering for Scientific Computing ● Joint Master theses with – Ralf Reussner – Carsten Sinz – Dorothea Wagner – Anne Koziolek – Peter Sanders Advertisement II ● This summer ● Not only the Bioinformatics seminar ● But also, Algorithmic Methods in Humanities with chair of Dorothea Wagner → two places available → a lot of bioinformatics methods can be applied to language analysis Missing Slides ● Taxonomy slides not discussed in lecture 2 → in next lecture: introduction to phylogenetics BLAST Index – Solution 1 The scanning phase raised a classic algorithmic problem, i.e. search a long sequence for all occurrences of certain short sequences. We investigated 2 approaches. Simplified, the first works as follows. suppose that w = 4 and map each word to an integer between 1 and 204, so a word can be used as an index into an array of size 204 = 160,000. Let the ith entry of such an array point to the list of all occurrences in the query sequence of the ith word. Thus, as we scan the database, each database word leads us immediately to the corresponding hits. Typically, only a few thousand of the 204 possible words will be in this table, and it is easy to modify the approach to use far fewer than 204 pointers. BLAST Index – Solution 2 The second approach we explored for the scanning phase was the use of a deterministic finite automaton or finite state machine (Mealy, 1955; Hopcroft & Ullman, 1979). An important feature of our construction was to signal acceptance on transitions (Mealy paradigm) as opposed to on states (Moore paradigm). In the automaton's construction, this saved a factor in space and time roughly proportional to the size of the underlying alphabet. This method yielded a program that ran faster and we prefer this approach for general use. With typical query lengths and parameter settings, this version of BLAST scans a protein database at approximately 500,00 residues/s. # Codons per Residue ● Why certain residues have 6 codons and others less, has to do with base pair interactions. For example, in vitro experiments show that the third base for residue Valine does not have any discriminatory function suggesting a "two out of three" reading method. ● The number of codons per residue also correlates with its frequency of occurrence in protein molecules. Residues Ser and Leu have 6 codons and are pretty common in proteins. Viruses and the Tree of Life “We believe that considering viruses alive or not is not just a matter of opinion, contrary to a commonly held view, but rather is a matter of inference and logic starting from any given definition of life.” “Although most biologists would argue that viruses are not alive, some argue that viruses should be included in the tree of life.” “Presently viruses do not find a place on the universal tree of life, which is thus only a tree of cellular life. However, viruses cannot be dismissed as non-living material. We have therefore at least two large DNA sequence spaces, one represented by viruses and another by ribosome-encoding cells. Despite their probable distinct evolutionary origin, both spheres were and are connected by intensive two-way gene transfers.” Plan for next lectures ● Today: Multiple Sequence Alignment ● Lecture 6: Introduction to phylogenetics ● Lecture 7: Phylogenetic search algorithms ● Lecture 8 (Alexis): The phylogenetic Maximum Likelihood Model ● Lecture 9 (Diego): Phylogenetic Maximum Likelihood Testing & Comparing Models Multiple Sequence Alignment ● What are we trying to reconstruct? Insertions, Deletions & Substitutions ATTGCG CTTGCG T ATTGCG i m e ATGCG ATTGCG CTTGCG CTTGCAAG Insertions, Deletions & Substitutions ATTGCG A → C: substitution CTTGCG T ATTGCG i m e ATGCG ATTGCG CTTGCG CTTGCAAG Insertions, Deletions & Substitutions ATTGCG A → C: substitution CTTGCG T ATTGCG i m e VOID → AA: insertion ATGCG ATTGCG CTTGCG CTTGCAAG Insertions, Deletions & Substitutions ATTGCG A → C: substitution CTTGCG T ATTGCG i m e VOID → AA: insertion ATGCG ATTGCG CTTGCG CTTGCAAG We call this: “an indel” From insertion-deletion The indel length here is 2, longer indels lengths are not uncommon! Insertions, Deletions & Substitutions ATTGCG A → C: substitution CTTGCG T ATTGCG i m e VOID → AA: insertion T → VOID: deletion AT-GCG ATTGCG CTTGCG CTTGCAAG Insertions, Deletions & Substitutions ATTGCG A → C: substitution CTTGCG T ATTGCG i m T e → VOID: deletion VOID → AA: insertion AT-GCG ATTGCG CTTGCG CTTGCAAG AT-GC--G Aligned data: ATTGC--G CTTGC--G CTTGCAAG Insertions, Deletions & Substitutions ATTGCG A → C: substitution CTTGCG T ATTGCG i m T e → VOID: deletion Compute whichVOID characters → AA: shareinsertion a common evolutionary history! This is also called: inferring homology AT-GCG ATTGCG CTTGCG CTTGCAAG AT-GC--G Aligned data: ATTGC--G CTTGC--G CTTGCAAG Multiple Sequence Alignment ● So far: ● Comparing two sequences ● Mapping a sequence/read to a reference genome ● What do we do when we want to compare more than two sequences at a time? ● Multiple Sequence Alignment (MSA) ● Open question: how do we assess the quality/accuracy of MSA algorithms? → nice review paper: “Who watches the watchmen?” http://arxiv.org/abs/1211.2160 Why do we need MSAs? ● Input for phylogenetic reconstruction ● Discover important (conserved) parts of a protein family ● Protein family → group of evolutionarily related genes/proteins in different species with similar function/structure ● Family has a different meaning than in taxonomy! MSA ● Generalization of pair-wise sequence alignment problem ● Given n orthologous sequences s1,...,sn of different lengths, insert gaps “-” such that: ● All sequences have the same length ● Some criterion is optimized ● Corresponding (homologous) characters in si and sj are aligned to each other (in the same alignment column/site) ● Columns/sites that entirely consist of gaps are not allowed MSA Terminology Orthologous sequences: s1 M Q P I L L L Sequences in different species that have evolved from the same s2 M L R - L L - ancestral gene s3 M K - I L L L → sequences that share a common s4 M P P V L I L evolutionary history Alignment site/Alignment column MSA Terminology Homologous characters: Characters that share a common evolutionary history s1 M Q P I L L L s2 M L R - L L - s3 M K - I L L L s4 M P P V L I L Alignment site/Alignment column MSA Terminology Homologous characters: Characters that share a common evolutionary history Note that, in this column the characters s1 M Q P I L L L are similar (analogous), but this does s2 M L R - L L - not automatically induce homology! s3 M K - I L L L They could be similar by chance or via s4 M P P V L I L Convergent evolution (see slides later-on) Alignment site/Alignment column Orthology speciation Gene lineage Gene duplication Species tree speciation Orthology speciation Gene lineage Gene duplication Species tree speciation orthologous Orthology speciation Gene lineage Gene duplication Species tree speciation orthologous paralogous Orthology speciation Gene lineage Gene duplication Species tree speciation homologous orthologous paralogous Homology ● High sequence similarity does not automatically induce homology ● Same sequence (gene function) can have evolved independently twice → convergent evolution ● For short sequences: similar by chance parent parent parent offspring offspring offspring offspring Convergent Evolution Orthology Assignment ● Numerous methods available ● Will not be covered here → difficult problem ● Let's assume that we have a set of n orthologous sequences s1,...,sn and see how we can align them Alignment Criteria ● How do we define alignment quality? ● There are different criteria ● The SP (sum of pairs) measure ● Real data benchmarks ● Curated alignments (based on protein structure) ● Evolutionary measures ● Simulations Alignment Criteria ● How do we define alignment quality? ● There are different criteria ● The SP (sum of pairs) measure ● Real data benchmarks ● Curated alignments (based on protein structure) ● Evolutionary measures ● Simulations The SP measure ● SP: sum-of-pairs score ● Score each MSA site and then add up the scores over all sites ● Penalize mismatches and gaps ● Favor matches ● The per-site score is defined as the sum of all pairwise scores between characters of a site SP an example ● SP-score(I, -, I, V) = p(I,-) + p(I, I) + p(I, V) + p(-, I) + p(-, V) + p(I, V) ● Where p() is the penalty function and p(-,-) := 0 ● Given a MSA with n sequences and m sites we can thus compute the overall score as: sp = 0; for(i = 0; i < m; i++) sp += SP-score(sites[i]); An example s1 A A G A A - A s2 A T - A A T G s3 C T G - G - G Using the the edit distance for p() the score is: 2 + 2 + 2 + 2 + 2 + 2 + 2 = 14 Note that, we can also compute this as the sum of pair-wise edit distances between the aligned sequences: e(s1,s2) + e(s1,s3) + e(s2,s3) = 4 + 5 + 5 Keep in mind that, p(-,-) := 0 The SP measure ● Note that, this is only one way to quantify the quality of an alignment ● One can build and alignment algorithm that optimizes the SP measure ● However, alignments (MSAs) with larger SP scores may better represent the true evolutionary history of the characters! How can we extend pair-wise alignment to triple-wise alignment? ● Any ideas? ● What is the time and space complexity? SP-based optimization ● We can extend the dynamic programming approach for pair-wise sequence alignment to n sequences to calculate an SP-optimal MSA ● Assume that all n sequences have equal length m ● Storing the dynamic programming matrix requires O(mn) space ● And the lower bound for time is also O(mn) because all mn entries need to be computed → consider an example with n:= 3 ● As you can imagine computing the SP-optimal MSA is NP-complete SP-based MSA ● NP-complete ● Not granted that SP is the correct (biologically most plausible) criterion! ● Depends on -arbitrary- choice of scoring function p() ● We need heuristics! ● We will have a look at some basic heuristics in the following ..