<<

Bioinformatics Algorithms

Multiple

David Hoksza http://siret.ms.mff.cuni.cz/hoksza Outline

• Motivation • Algorithms

• Scoring functions • exhaustive • multidimensional

• heuristics • progressive alignment • iterative alignment/refinement • block(local)-based alignment Multiple sequence alignment (MSA)

• Goal of MSA is to find “optimal” mapping of a set of sequences

• Homologous residues (originating in the same position in a common ancestor) among a set of sequences are aligned together in columns

• Usually employs multiple pairwise alignment (PA) computations to reveal the evolutionarily equivalent positions across all sequences Motivation

• Distant homologues • faint similarity can become apparent when present in many sequences • motifs might not be apparent from pairwise alignment only

• Detection of key functional residues • amino acids critical for function tend to be conserved during the and therefore can be revealed by inspecting sequences within given family

• Prediction of secondary/tertiary structure

• Inferring evolutionary history

4 Representation of MSA

• Column-based representation

• Profile representation (position specific scoring matrix)

Manual MSA

• High quality MSA can be carried out automatic MSA algorithms by hand using expert knowledge • specific columns • BAliBASE • highly conserved residues • https://lbgi.fr/balibase/ • buried hydrophobic residues • PROSITE • secondary structure (especially in RNA • http://prosite.expasy.org/ alignment) • Pfam • expected patterns of insertions and • http://pfam.sanger.ac.uk/ deletions • TIGRFAM • http://www.jcvi.org/cgi- bin/tigrfams/index.cgi • Tedious, but • … (some databases are semi-automatic • high-quality source of family and many of the databases construct the information MSA from the structure information) • a benchmark for evaluation of Scoring

• How to score an MSA?

푺 푨 = 푮 + ෍ 푪푺(푨풊)

• 퐴푖 … 푖-th column • 퐶푆(퐴푖) … score of the 푖-th column • 퐺 … gap function (assumes linear or constant )

• the score assumes independent columns

• Two score types are usually considered • minimum entropy (ME) • sum of pairs (SP) Minimum entropy (1)

• ME aims to minimize entropy of each column

• columns with low entropy (can be expressed with only few bits) are good for the alignment

• the more bits we need to express a column, the more divers the column is Minimum entropy (2)

• Probability of a column • assumption of independency between columns and residues within columns 풄풊풂 푷 푨풊 = ෑ 풑풊풂 풂 0 퐴푖 [푗] ≠ 푎 • 푐푖푎…observed counts for residue 푎 in 푖-th column 푐푖푎 = σ푗 ൝ 1 퐴푖 푗 = 푎 • 퐴푖 [푗]… 푗-th symbol in 푖-th column • 푝푖푎… probability of residue 푎 in column 푖

푪푴푬 푨풊 = − ෍ 풄풊풂 퐥퐨퐠 풑풊풂 푴푬 = ෍ 푪푴푬(푨풊) 풂 풊 • completely conserved column would score 0 Sum of pairs

• Sum of scores of all possible pairs in a multiple alignment 푨 for a particular scoring matrix • Score for each column is computed as the sum of all pairs of position in that column • Column scores are then summed to get the SP-score |퐴| |퐴|

푆푃 퐴 = ෍ 퐶푆푃 퐴푖 = ෍ ෍ 휎(퐴푖 푘 , 퐴푖 푙 ) 푖=1 푖=1 푘<푙

• 퐴푖 [푘]… 푘-th symbol in 푖-th column • 휎(푥, 푦) … PAM or BLOSUM values for the residue 푥 and 푦 G K N SP - Example T R N S H E • BLOSUM 62 scoring matrix -1 +1 +6 6 SP score drawback • Alignment of 푵 sequences, all containing leucine at given position from functional reasons • BLOSUM62 matrix 흈 푳, 푳 = ퟒ → 푺푷 푨풊 = ퟒ × 푵(푵 − ퟏ)/ퟐ • Let us replace one of the leucines with glycine (incorrect alignment) 흈 푳, 푮 = −ퟒ → the score decreases by ퟖ × (푵 − ퟏ) • 푺푷 푨풊 is worse by a fraction of 8×(푁−1) ퟒ = 4×푁(푁−1)/2 푵 • Relative difference in score between the correct alignment and incorrect alignment decreases with the number of sequences in the alignment • BUT increasing the number of sequences (evidence) should give us more increased relative difference Multidimensional dynamic programming (1)

• Generalization of pairwise dynamic programming • 3 sequences: ATGC, AATC,TTGC

0 1 1 2 3 4 x coordinate A - T G C 0 1 2 3 3 4 y coordinate A A T - C 0 0 1 2 3 4 z coordinate - T T G C

• Resulting path • (0,0,0) → (1,1,0) → (1,2,1) → (2,3,2) → (3,3,3) → (4,4,4) Multidimensional dynamic programming (2)

• Let us assume linear gap penalty model (not affine)

• 훾 푔 = 푔푑 for a gap of length 푔 and gap cost 푑

• initialization and backtracking are analogous with the 2D case Multidimensional dynamic programming (3)

• 3 edges • 7 edges Computational complexity of MDP

• Computation of each of the DP matrix takes ퟐ푵 − ퟏ (all possible combinations of gaps column)

• Let us assume all the sequences have approximately the same length 푳

• Memory complexity 푶 푳푵 • Time complexity 푶 ퟐ푵푳푵 MDP - exercise

• Let’s have sequence of length 50 • Comparison of a pair of sequences using DP takes 0,1s

• What is the time needed to compare 4 sequences?

• Let’s say we have 1000 years and average sequence length is 50. • How many sequence can afford to compare? Heuristic Algorithms

• Progressive alignment methods • iterative building of the alignment • Block-based alignment • Feng & Doolittle • local alignment built by identifying • ClustalW, Omega blocks of ungapped MSA identified and assembled • Consistency-based methods • DIALIGN • T-Coffee • Mix of approaches • Iterative refinement • MAFFT, MUSCLE • alignment built and then refined be realigning the constituent sequences • Barton & Sternberg Progressive alignment

• Framework • First, two sequences are aligned using standard pairwise alignment • The remaining sequences are taken one by one and aligned to the previous ones • Repeated until all sequences are aligned

• Parameters • The order in which the sequences are be aligned • Whether only one alignment is kept and sequences are added to it or whether also an alignment can be aligned to another alignment (as if a tree was being built) • The process used to align and score sequences or alignments against the existing ones Star alignment

• N sequences 풔ퟏ, … , 풔푵 to be aligned

1. Pick 풔풊 as a starting sequence – center

2. Compute all optimal global alignments between 풔풊 and 풔풋, 푗 ≠ 푖

3. Successively merge sequences into the arising MSA • once a gap always a gap rule • if a gap is introduced into the MSA it stays there forever SA – example (1)

S1: ATTGCCATT ATTGCC-ATT-- S2: ATGGCCATT ATTGCCATT ATGGCC-ATT-- S3: ATCCAATTTT ATGGCCATT ATTGCCGATT-- S4: ATCTTCTT ATCTTC--TT-- S5: ATTGCCGATT ATC-CA-ATTTT

ATTGCC-ATT ATTGCCATT-- ATTGCCATT ATTGCCGATT ATC-CAATTTT

ATTGCCATT ATCTTC-TT

credit: Xingquan Zhu, Florida Atlantic University SA – example (2)

pairwise alignment multiple alignment ATTGCCATT ATTGCCATT 1. ATGGCCATT ATGGCCATT ATTGCCATT-- ATTGCCATT-- 2. ATGGCCATT-- ATC-CAATTTT ATC-CAATTTT ATTGCCATT-- ATTGCCATT ATGGCCATT-- 3. ATCTTC-TT ATC-CAATTTT ATCTTC-TT-- ATTGCC-ATT-- ATGGCC-ATT-- ATTGCC-ATT 4. ATC-CA-ATTTT ATTGCCGATT ATCTTC--TT-- ATTGCCGATT-- SA - choosing the center

• Compute all pairwise alignment and pick sequence 풔풊 with maximum σ풋≠풊 풔(풔풊, 풔풋) • Choosing the sequence which is most similar to all the rest

• Compute all pairwise alignments and compute MSA for every 풔풊 and pick the best SA – time complexity

• Average sequence length 퐿

• One global alignment computation in 퐎(푳ퟐ)

• 푘 sequences → 퐎(풌ퟐ푳ퟐ) pairwise computations

• 푙 … upper bound on the MSA length → 퐎(풍풌) for MSA construction

푂 푘2퐿2 + 푙푘 = 푶(풌ퟐ푳ퟐ) SA - exercise

• Compute SP for the constructed MSA

• Compute SA for the previous example but add sequences to the MSA in different order. Does the order of addition impacts the score?

• Compute MSA starting with S5. Does the score change?

ATTGCC-ATT ATGGCC-ATT AT--CCAATTTT AT--CTTCTT ATTGCCGATT ATTGCCGATT ATTGCCGATT-- ATTGCCGATT Feng & Doolittle (1)

푆 푎,푎 +푆 푏,푏 • 푆 푎, 푏 = 푚푎푥 2 1. Calculate a from all-to-all pairwise • 푆푟푎푛푑 is an expected score alignments (푁(푁 − 1)/2) obtained by randomization • 푆푒푓푓 can be viewed as normalized 2. Convert raw alignment scores into (evolutionary) distances percentage similarity which decreases roughly exponentially to 0 with increasing evolutionary distance. • –log makes the measure linear with 푆표푏푠−푆푟푎푛푑 • 퐷 = − log 푆푒푓푓 × 100 = − log × 100 evolutionary distance 푆푚푎푥−푆푟푎푛푑

3. Construct a guide tree from the distance matrix using Fitch & Margoliash algorithm 4. Align child nodes of each parent (can be sequence- sequence, sequence-MSA, MSA-MSA) in the order they were added to the tree

source: Feng, Da-Fei, and Russell F. Doolittle. "Progressive sequence alignment as a prerequisitetto correct phylogenetic trees." Journal of molecular evolution 25.4 (1987): 351-360. Feng & Doolittle (2)

• Sequence-sequence is aligned using classical dynamic programming

• Sequence-MSA – sequence is aligned with each sequence in the group and the highest scoring alignment defines how the sequence is added to the group

• MSA-MSA – as in previous case but all pairs of sequences are tested

• When a sequence is added to a group, neutral symbol X is introduced instead of the gap position • allows to align gap positions • neutral – anything aligned with X scores 0 • side effect – the gaps in two MSAs tend to come together in the resulting MSA Profile/MSA Alignment

• When adding a sequence to a group it is desirable to take into account the MSA built so far • mismatches at highly conserved positions should be penalized more • 2 MSA (profiles) of 푁 sequences, one from 1. . 푛, second 푛 + 1. . 푁

෍ 푺 푨 풊 = ෍ ෍ 흈(푨풌 풊 , 푨풍 풊 ) 풊 풊 풌<풍≤푵

= ෍ ෍ 흈(푨풌 풊 , 푨풍 풊 ) + ෍ ෍ 흈(푨풌 풊 , 푨풍 풊 ) + ෍ ෍ 흈(푨풌 풊 , 푨풍 풊 ) 풊 풌<풍≤풏 풊 풏<풌<풍≤푵 풊 풌≤풏,풏<풍≤푵 • The score of the σ푖 σ푘<푙≤푁 휎(퐴푘 푖 , 퐴푙 푖 ) consists of the in-group scores plus between group scores • when aligning the profiles we can use standard dynamic programming where columns are aligned against columns using the in-between scores • → using position-specific information from the group’s multiple alignment ClustalW

• Similar to Feng & Doolittle but uses profile-based building

1. Calculate matrix from all-to-all pairwise alignments (푁(푁 − 1)/2)

2. Convert raw alignment similarity scores into evolutionary distances

3. Construct a guided tree from the distance matrix using algorithm

4. Progressively align the nodes in order of decreasing similarity, using sequence-sequence, sequence-profile, and profile-profile alignment ClustalW - Alignment

column-based global alignment

standard global alignment ClustalW – heuristics (1)

• Weighting of subsequences to compensate for biased representation in large subfamilies – compensates defects of sum-of-pairs scoring • sequence contributions to the MSA are weighted by their relationships in the predicted evolutionary tree

• Closely related sequences BLOSUM 80 – distant sequences BLOSUM 50

• Position-specific gap open profile penalties multiplied by a modifier being function of the residues observed at the position • comes from structure-based alignments ClustalW – heuristics (2)

• Gap penalties are higher if there are no gaps at given position but some exist at nearby positions • force the gaps to be in the same places

• Guide tree can be adjusted on the fly to postpone aligning low-similar sequences up to the point where more information is present

• … Clustal Omega

• ClustalW does not scale well for a big number (thousands) of sequences • Bottleneck is the guide tree construction → 푂(푀 × 푁2) • Clustal Omega algorithm • mBed algorithm-based for guide tree construction • emBedding of each sequence in a space of 푛 dimensions where n is proportional to log N • Each sequence replaced by an 푛 element vector, where each element is the distance to one of 푛 reference sequences • Clustering of the vectors by UPGMA or K-means • Alignments of profiles using hidden Markov models (HHalign package)

• Additional features • Adding sequences to existing alignments • External profiles - HMM profile from sequences homologous to the input set which can be used in MSA construction

33 Progressive alignment drawbacks

• Drawback of the progressive methods is the greedy of the algorithm • once an error, always an error

• Solutions • Iterative approach • to correct mistakes in the initial alignment … which might happen easily if pairwise sequence similarity is too low

• Consistency-based approach • Tries to avoid mistakes in advance

34 T-Coffee

• A progressive alignment with the ability to consider information from all of the sequences during each alignment step, not just those being aligned at that stage (consistency)

• Tackles ClustalW drawback T-Coffee algorithm (1)

1. Primary library generation • Generate primary libraries of sequence alignments (using several methods) • For each pair compute weights based on percent identity (shorter sequence) • Merge the primary libraries – pairs of aligned residues get a weight equal to the sum of the weights 2. Extended library generation • For each pair of sequences try to align them using an intermediate sequence • Score each pair of residues by the lower of the two weights and sum the weights for that residue pair over all triplets 3. Progressive alignment • Guided tree is used to build the multiple alignment • A pair of sequences is aligned using dynamic programming with weights based on the extended library • When aligning two sets of sequences, averaged library weights are used T-Coffee algorithm (2)

77+88 = 165 Iterative refinement

• Once an error, always an error • when a sequence is added to a MSA it cannot be changed later on (holds also for consistency-based approaches) • dependence on the initial alignment

• Idea behind iterative refinement methods • optimal solution can be obtained by iterative improvement of suboptimal solutions

• First, a suboptimal solution is identified by a fast heuristic method and then possibly improved by iterative removal and re-addition of the sequences Barton-Sternberg

1. Find the highest scoring pair of sequences and align them using the pairwise dynamic programming 2. Identify most similar remaining sequence with respect to the existing profile and align it using sequence-profile alignment 3. Repeat step 2 until all sequences are aligned

4. Remove sequence 풔ퟏ and realign it to the profile by profile-sequence alignment. Repeat for 풔ퟐ, … , 풔푵 5. Repeat step 4 for a fixed number of times or until convergence Block-based alignment

• Both progressive and iterative methods assume that all parts of all sequences are consistent with a global alignment • not every position in the alignment has to correspond to a homologous site in all sequences

• Block-based solution approach to the problem of global alignment by • splitting sequences into blocks • aligning the blocks • merging block alignments DIALIGN

• Alignment constructed from gap-free local alignments between pairs of sequences • based on diagonals in the dynamic programming matrix • Procedure • align all possible pairs of sequences • determine all diagonals and assign weights • for a diagonal 퐷 of length 푙, score 풔 is obtained from • determine length-independent weight as 풘 푫 = − 퐥퐨퐠 푷(풍, 풔), where 푷(풍, 풔) is the probability that diagonal of sequence of length 풍 will have score at least 풔 • build MSA by adding consistent diagonals in order of decreasing weight (and overlap with other diagonals) • explore unaligned regions and include them if possible DIALIGN - Example

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa DIALIGN – Example

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa DIALIGN – Example

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa DIALIGN – Example

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa DIALIGN – Example

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa DIALIGN – Example

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa DIALIGN – Example

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa DIALIGN – Example

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaa--gagtatcacccctgaattgaataa DIALIGN – Example

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaa--gagtatcacc------cctgaattgaataa DIALIGN – Example

atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac------gg-ttcaatcgcg

caaa--gagtatcacc------cctgaattgaataa DIALIGN – Example

atc------TAATAGTTAaactccccCGTGC-TTag

cagtgcGTGTATTACTAAc------GG-TTCAATcgcg

caaa--GAGTATCAcc------CCTGaaTTGAATaa FFT-NS-1 MAFFT

• Multiple Alignment using Fast Fourier Transform FFT-NS-2 • 2-cycle progressive method + iterative refinement • Fast low-quality all pairwise distances • Tentative MSA construction • Refinement of distances based on MSA • Second progressive alignment stage • Iterative refinement • Reducing CPU time • 6-mer method for pairwise alignments FFT-NS-i • FFT-based group-to-group alignment algorithm

53 MAFFT – guide tree construction

• Sequences and groups are progressively aligned using FFT-based alignment (next slides) in the order given by a guide tree

• To quickly obtain the all-to-all distance matrix • 20 AAs grouped into 6 physico-chemical groups • Number of shared 6-tuples 푇푖푗 is computed and turned into 퐷푖푗 • 퐷푖푗 = 1 − [푇푖푗/min(푇푖푖, 푇푗푗)]

• UPGMA method is used to obtain the guide tree from 퐷푖푗

54 MAFFT – FFT-based group-to-group alignment (1)

• AA sequence converted to a sequence of vectors (signals) of normalized volume (푣ො 푎 = 푣 푎 − 푣 /휎푣) and polarity (푝Ƹ 푎 = 푝 푎 − 푝 /휎푝) • Correlation between 2 AA sequences is assessed using • 푐 푘 = 푐푣 푘 + 푐푝 푘 • 푐푥 푘 = σ1≤푛≤푁,1≤푛+푘≤푀 푥ෞ1(푛)푥ෞ2(푛 + 푘) , 푥 ∈ {푣, 푝} • 푥ෝ푖(푗) … 푥 component (푣 or 푝) of 푗-th site in sequence 푖 ∗ If 푋푖(푚) are FFT of 푥ෝ푖(푛), then, by cross-correlation theorem, 푐푥 푘 ֞ 푋1 푚 ∗ 푋2 푚 • • FFT reduces the 푂(푛2) time to 푂(푛 log 푛)

• Correlation has high peaks when the two compared sequences have high similarity regions offset by the lags • sliding window of given size is used to reveal homologous regions (sequence identity in the window is measured) • Successive homologous sequences are combined

55 MAFFT – FFT-based group-to-group alignment (2)

• Matrix 푺풊풋 (1 ≤ 푖, 푗 ≤ 푛) is constructed with values corresponding to the scores of the 풏 identified homologous segments (0 if (푖, 푗) is not homologous pair) • Optimal path through 푺 is identified • DP matrix of the sequences is then computed, where the optimal path must go through the centers of the segments of optimal path in 푆, thus restricting the number of elements which need to be visited

56 MAFFT – FFT-based group-to-group alignment (3) • Group-to-group alignment is extension of the approach by considering groups as linear combinations of the volume/polar components of the groups • 푥ො 푛 = σ 푤 푥ො (푛) , 푥 ∈ 푣, 푝 , 푖 ∈ {1,2} 푔푟표푢푝푖 푗∈푔푟표푢푝푖 푗 푗 • 푤푗 is weighting factor for sequence 푗 calculated as in CLUSTALW in the progressive stage, and as in (*) in the iterative stage

• For sequences, the 2D vectors consisting of polar/volume components are replaced by 4D vector of A, C, G, T components

57 (*) Gotoh, O. (1995). A weighting system and algorithm for aligning many phylogenetically related sequences. , 11(5), 543-551. MAFFT – iterative refinement

• Alignment divided into two groups and realigned • Tree-dependent restricted partitioning * • A tree-dependent, restricted partitioning technique to efficiently reduce the execution time of iterative algorithms

• Repeated until no better alignment is obtained

58 (*) Hirosawa, M., Totoki, Y., Hoshida, M., & Ishikawa, M. (1995). Comprehensive study on iterative algorithms of multiple sequence alignment. Bioinformatics, 11(1), 13-18. MAFFT – improvements

• Version 5 • Consistency-base scoring - new objective function to reveal distant homologues applied to the iterative refinement stage • TCoffee-like approach of incorporation of all pairwise alignment information into the objective function • Computed from all-to-all pairwise alignments before constructing MSA • Summation of weighted sum-of-pairs score • Dropped re-construction phase of the guide tree • Version 6 • New tree-building algorithm, PartTree, for handling larger numbers of sequences • multiple ncRNA alignment framework incorporating structural information

59 MUSCLE

• MUltiple Sequence Comparison by Log- Expectation • Stages 1. Draft progressive 2. Improved progressive 3. Refinement

61 MUSCLE details

• k-mer distance • Fraction of common k-mers in a pair of sequences • Possibly on compressed alphabet (similar residues, e.g. hydrophobic, get the same letter) • Approximates well fraction of common residues in global alignment • Kimura distance (correction) • Computed from fractional identity of sequences 퐷 which is good approximation for closely related sequences • exact if positions are allowed to mutate only ones → multiple on single site (more distant sequences) require correction 퐷2 • 푑 = − log (1 − 퐷 − ) 푘푖푚푢푟푎 푒 5 • Stage 3 refinement • Choose edge (go from leafs to root) from TREE2 and delete it • Build profile for MSA of the resulting tree, re-align and accept change if it led to an improvement • Iterate until convergence

62 Comparison of MSA algorithms - benchmark

• Benchmarking → guidelines • Commonly used dataset for benchmarking MSA algorithms is BAliBASE • high quality manually refined reference alignments based on 3D structural superpositions

cases with small families with one divergent sequences with sequences with sequences with families with linear sequences63 with numbers of or more “orphan” subfamilies, with large terminal large internal repeats, motifs often found subfamily specific equidistant sequences less than 25% extensions insertions and transmembrane in disordered features, motifs in sequences, identity between deletions regions, and regions that are disordered regions and was further the groups inverted domains, difficult to align and subdivided by respectively fragmentary/erron percent identity eous sequences Comparison of main algorithms

SP: sum of pairs TC: based on number of columns aligned 100% correctly 64 source: Pais, F. S. M., de Cássia Ruy, P., Oliveira, G., & Coimbra, R. S. (2014). Assessing the efficiency of multiple sequence alignment programs. Algorithms for Molecular Biology, 9(1), 4. Guidelines - accuracy • No MSA program outperformed all others in all test cases

• For the R1-5 sets T-Coffee, Probcons, MAFFT and Probalign were superior with regard to alignment accuracy • When aligning available short versions of the sequences (BBS), Probcons and T-Coffee outperformed Probalign and MAFFT • statistically significant superiority of Probcons and T-Coffee in comparison to Probalign and MAFFT in R1 & R2 • When aligning full-length (BB) of R1, -3 and R5, which represent more difficult test cases, and also R4, where large terminal extensions are present, Probalign, MAFFT and, surprisingly, CLUSTAL OMEGA, generally outperformed both Probcons and T-Coffee • T-Cofee and Probcons worked great when aligning truncated sequences but did not do that well for datasets with long N/C terminal ends due to presence of non-conserved residues at terminal ends • MAFFT, Probalign and even CLUSTAL OMEGA may be preferred over T-Coffee and Probcons when aligning sequences with these long terminal extensions • Contradicting performance of CLUSTAL OMEGA - performed well in three reference R3-5 sets with full-length sequences but not with the short versions • CLUSTALW, DIALIGN-TX and POA had Z-scores below the average in almost all test cases from the first five reference sets 65 source: Pais, F. S. M., de Cássia Ruy, P., Oliveira, G., & Coimbra, R. S. (2014). Assessing the efficiency of multiple sequence alignment programs. Algorithms for Molecular Biology, 9(1), 4. Guidelines – computational costs

• Speed • CLUSTALW and MUSCLE were the fastest of the evaluated programs • T-Coffee and MAFFT deliver fast alignments in multi-core environment • Probcons and Probalign exceeded the 2.5 hours cutoff in the last three subsets from Reference 9

• Memory • CLUSTALW consumed least memory • T-Coffee running in single-core mode - results indicated that the program consumed generally more RAM than the others and was also the slowest

66 source: Pais, F. S. M., de Cássia Ruy, P., Oliveira, G., & Coimbra, R. S. (2014). Assessing the efficiency of multiple sequence alignment programs. Algorithms for Molecular Biology, 9(1), 4. Phylo DNA Puzzles

67