Multiple Sequence Alignment

Bioinformatics Algorithms Multiple Sequence Alignment David Hoksza http://siret.ms.mff.cuni.cz/hoksza Outline • Motivation • Algorithms • Scoring functions • exhaustive • multidimensional dynamic programming • heuristics • progressive alignment • iterative alignment/refinement • block(local)-based alignment Multiple sequence alignment (MSA) • Goal of MSA is to find “optimal” mapping of a set of sequences • Homologous residues (originating in the same position in a common ancestor) among a set of sequences are aligned together in columns • Usually employs multiple pairwise alignment (PA) computations to reveal the evolutionarily equivalent positions across all sequences Motivation • Distant homologues • faint similarity can become apparent when present in many sequences • motifs might not be apparent from pairwise alignment only • Detection of key functional residues • amino acids critical for function tend to be conserved during the evolution and therefore can be revealed by inspecting sequences within given family • Prediction of secondary/tertiary structure • Inferring evolutionary history 4 Representation of MSA • Column-based representation • Profile representation (position specific scoring matrix) • Sequence logo Manual MSA • High quality MSA can be carried out automatic MSA algorithms by hand using expert knowledge • specific columns • BAliBASE • highly conserved residues • https://lbgi.fr/balibase/ • buried hydrophobic residues • PROSITE • secondary structure (especially in RNA • http://prosite.expasy.org/ alignment) • Pfam • expected patterns of insertions and • http://pfam.sanger.ac.uk/ deletions • TIGRFAM • http://www.jcvi.org/cgi- bin/tigrfams/index.cgi • Tedious, but • … (some databases are semi-automatic • high-quality source of family and many of the databases construct the information MSA from the structure information) • a benchmark for evaluation of Scoring • How to score an MSA? 푺 푨 = 푮 + ෍ 푪푺(푨풊) • 퐴푖 … 푖-th column • 퐶푆(퐴푖) … score of the 푖-th column • 퐺 … gap function (assumes linear or constant gap penalty) • the score assumes independent columns • Two score types are usually considered • minimum entropy (ME) • sum of pairs (SP) Minimum entropy (1) • ME aims to minimize entropy of each column • columns with low entropy (can be expressed with only few bits) are good for the alignment • the more bits we need to express a column, the more divers the column is Minimum entropy (2) • Probability of a column • assumption of independency between columns and residues within columns 풄풊풂 푷 푨풊 = ෑ 풑풊풂 풂 0 퐴푖 [푗] ≠ 푎 • 푐푖푎…observed counts for residue 푎 in 푖-th column 푐푖푎 = σ푗 ൝ 1 퐴푖 푗 = 푎 • 퐴푖 [푗]… 푗-th symbol in 푖-th column • 푝푖푎… probability of residue 푎 in column 푖 푪푴푬 푨풊 = − ෍ 풄풊풂 퐥퐨퐠 풑풊풂 푴푬 = ෍ 푪푴푬(푨풊) 풂 풊 • completely conserved column would score 0 Sum of pairs • Sum of scores of all possible pairs in a multiple alignment 푨 for a particular scoring matrix • Score for each column is computed as the sum of all pairs of position in that column • Column scores are then summed to get the SP-score |퐴| |퐴| 푆푃 퐴 = ෍ 퐶푆푃 퐴푖 = ෍ ෍ 휎(퐴푖 푘 , 퐴푖 푙 ) 푖=1 푖=1 푘<푙 • 퐴푖 [푘]… 푘-th symbol in 푖-th column • 휎(푥, 푦) … PAM or BLOSUM values for the residue 푥 and 푦 G K N SP - Example T R N S H E • BLOSUM 62 scoring matrix -1 +1 +6 6 SP score drawback • Alignment of 푵 sequences, all containing leucine at given position from functional reasons • BLOSUM62 matrix 흈 푳, 푳 = ퟒ → 푺푷 푨풊 = ퟒ × 푵(푵 − ퟏ)/ퟐ • Let us replace one of the leucines with glycine (incorrect alignment) 흈 푳, 푮 = −ퟒ → the score decreases by ퟖ × (푵 − ퟏ) • 푺푷 푨풊 is worse by a fraction of 8×(푁−1) ퟒ = 4×푁(푁−1)/2 푵 • Relative difference in score between the correct alignment and incorrect alignment decreases with the number of sequences in the alignment • BUT increasing the number of sequences (evidence) should give us more increased relative difference Multidimensional dynamic programming (1) • Generalization of pairwise dynamic programming • 3 sequences: ATGC, AATC,TTGC 0 1 1 2 3 4 x coordinate A - T G C 0 1 2 3 3 4 y coordinate A A T - C 0 0 1 2 3 4 z coordinate - T T G C • Resulting path • (0,0,0) → (1,1,0) → (1,2,1) → (2,3,2) → (3,3,3) → (4,4,4) Multidimensional dynamic programming (2) • Let us assume linear gap penalty model (not affine) • 훾 푔 = 푔푑 for a gap of length 푔 and gap cost 푑 • initialization and backtracking are analogous with the 2D case Multidimensional dynamic programming (3) • 3 edges • 7 edges Computational complexity of MDP • Computation of each cell of the DP matrix takes ퟐ푵 − ퟏ (all possible combinations of gaps column) • Let us assume all the sequences have approximately the same length 푳 • Memory complexity 푶 푳푵 • Time complexity 푶 ퟐ푵푳푵 MDP - exercise • Let’s have sequence of length 50 • Comparison of a pair of sequences using DP takes 0,1s • What is the time needed to compare 4 sequences? • Let’s say we have 1000 years and average sequence length is 50. • How many sequence can afford to compare? Heuristic Algorithms • Progressive alignment methods • iterative building of the alignment • Block-based alignment • Feng & Doolittle • local alignment built by identifying • ClustalW, Clustal Omega blocks of ungapped MSA identified and assembled • Consistency-based methods • DIALIGN • T-Coffee • Mix of approaches • Iterative refinement • MAFFT, MUSCLE • alignment built and then refined be realigning the constituent sequences • Barton & Sternberg Progressive alignment • Framework • First, two sequences are aligned using standard pairwise alignment • The remaining sequences are taken one by one and aligned to the previous ones • Repeated until all sequences are aligned • Parameters • The order in which the sequences are be aligned • Whether only one alignment is kept and sequences are added to it or whether also an alignment can be aligned to another alignment (as if a tree was being built) • The process used to align and score sequences or alignments against the existing ones Star alignment • N sequences 풔ퟏ, … , 풔푵 to be aligned 1. Pick 풔풊 as a starting sequence – center 2. Compute all optimal global alignments between 풔풊 and 풔풋, 푗 ≠ 푖 3. Successively merge sequences into the arising MSA • once a gap always a gap rule • if a gap is introduced into the MSA it stays there forever SA – example (1) S1: ATTGCCATT ATTGCC-ATT-- S2: ATGGCCATT ATTGCCATT ATGGCC-ATT-- S3: ATCCAATTTT ATGGCCATT ATTGCCGATT-- S4: ATCTTCTT ATCTTC--TT-- S5: ATTGCCGATT ATC-CA-ATTTT ATTGCC-ATT ATTGCCATT-- ATTGCCATT ATTGCCGATT ATC-CAATTTT ATTGCCATT ATCTTC-TT credit: Xingquan Zhu, Florida Atlantic University SA – example (2) pairwise alignment multiple alignment ATTGCCATT ATTGCCATT 1. ATGGCCATT ATGGCCATT ATTGCCATT-- ATTGCCATT-- 2. ATGGCCATT-- ATC-CAATTTT ATC-CAATTTT ATTGCCATT-- ATTGCCATT ATGGCCATT-- 3. ATCTTC-TT ATC-CAATTTT ATCTTC-TT-- ATTGCC-ATT-- ATGGCC-ATT-- ATTGCC-ATT 4. ATC-CA-ATTTT ATTGCCGATT ATCTTC--TT-- ATTGCCGATT-- SA - choosing the center • Compute all pairwise alignment and pick sequence 풔풊 with maximum σ풋≠풊 풔(풔풊, 풔풋) • Choosing the sequence which is most similar to all the rest • Compute all pairwise alignments and compute MSA for every 풔풊 and pick the best SA – time complexity • Average sequence length 퐿 • One global alignment computation in 퐎(푳ퟐ) • 푘 sequences → 퐎(풌ퟐ푳ퟐ) pairwise computations • 푙 … upper bound on the MSA length → 퐎(풍풌) for MSA construction 푂 푘2퐿2 + 푙푘 = 푶(풌ퟐ푳ퟐ) SA - exercise • Compute SP for the constructed MSA • Compute SA for the previous example but add sequences to the MSA in different order. Does the order of addition impacts the score? • Compute MSA starting with S5. Does the score change? ATTGCC-ATT ATGGCC-ATT AT--CCAATTTT AT--CTTCTT ATTGCCGATT ATTGCCGATT ATTGCCGATT-- ATTGCCGATT Feng & Doolittle (1) 푆 푎,푎 +푆 푏,푏 • 푆 푎, 푏 = 푚푎푥 2 1. Calculate a distance matrix from all-to-all pairwise • 푆푟푎푛푑 is an expected score alignments (푁(푁 − 1)/2) obtained by randomization • 푆푒푓푓 can be viewed as normalized 2. Convert raw alignment scores into (evolutionary) distances percentage similarity which decreases roughly exponentially to 0 with increasing evolutionary distance. • –log makes the measure linear with 푆표푏푠−푆푟푎푛푑 • 퐷 = − log 푆푒푓푓 × 100 = − log × 100 evolutionary distance 푆푚푎푥−푆푟푎푛푑 3. Construct a guide tree from the distance matrix using Fitch & Margoliash algorithm 4. Align child nodes of each parent (can be sequence- sequence, sequence-MSA, MSA-MSA) in the order they were added to the tree source: Feng, Da-Fei, and Russell F. Doolittle. "Progressive sequence alignment as a prerequisitetto correct phylogenetic trees." Journal of molecular evolution 25.4 (1987): 351-360. Feng & Doolittle (2) • Sequence-sequence is aligned using classical dynamic programming • Sequence-MSA – sequence is aligned with each sequence in the group and the highest scoring alignment defines how the sequence is added to the group • MSA-MSA – as in previous case but all pairs of sequences are tested • When a sequence is added to a group, neutral symbol X is introduced instead of the gap position • allows to align gap positions • neutral – anything aligned with X scores 0 • side effect – the gaps in two MSAs tend to come together in the resulting MSA Profile/MSA Alignment • When adding a sequence to a group it is desirable to take into account the MSA built so far • mismatches at highly conserved positions should be penalized more • 2 MSA (profiles) of 푁 sequences, one from 1. 푛, second 푛 + 1. 푁 ෍ 푺 푨 풊 = ෍ ෍ 흈(푨풌 풊 , 푨풍 풊 ) 풊 풊 풌<풍≤푵 = ෍ ෍ 흈(푨풌 풊 , 푨풍 풊 ) + ෍ ෍ 흈(푨풌 풊 , 푨풍 풊 ) + ෍ ෍ 흈(푨풌 풊 , 푨풍 풊 ) 풊 풌<풍≤풏 풊 풏<풌<풍≤푵 풊 풌≤풏,풏<풍≤푵 • The score of the σ푖 σ푘<푙≤푁 휎(퐴푘 푖 , 퐴푙 푖 ) consists of the in-group scores plus between group scores • when aligning the profiles we can use standard dynamic programming where columns are aligned against columns using the in-between scores • → using position-specific information from the group’s multiple alignment ClustalW • Similar to Feng & Doolittle but uses profile-based building 1.

Multiple Sequence Alignment

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support