International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265

A Comprehensive Review and Performance Evaluation of Algorithms for DNA Sequences

Neelofar Sohi1 and Amardeep Singh2 1Assistant Professor, Department of Computer Science & Engineering, Punjabi University Patiala, Punjab, India 2Professor, Department of Computer Science & Engineering, Punjabi University Patiala, Punjab, India [email protected],[email protected]

Abstract

Background: Sequence alignment is very important step for high level sequence analysis applications in the area of biocomputing. Alignment of DNA sequences helps in finding origin of sequences, homology between sequences, constructing phylogenetic trees depicting evolutionary relationships and other tasks. It helps in identifying genetic variations in DNA sequences which might lead to diseases. Objectives: This paper presents a comprehensive review on sequence alignment approaches, methods and various state-of-the-art algorithms. Performance evaluation and comparison of few algorithms and tools is performed. Methods and Results: In this study, various tools and algorithms are studied, implemented and their performance is evaluated and compared using Identity Percentage is used as the main metric. Conclusion: It is observed that for pairwise sequence alignment, Omega Emboss Matcher outperforms other tools & algorithms followed by Clustal Omega Emboss Water further followed by Blast (Needleman-Wunsch algorithm for global alignment).

Keywords: sequence alignment; DNA sequences; progressive alignment; iterative alignment; natural computing approach; identity percentage.

1. Brief History

Sequence alignment is very important area in the field of Biocomputing and . Sequence alignment aims to identify the regions of similarity between two or more sequences. Alignment is also termed as ‘mapping’ that is done to identify and compare the nucleotide bases i.e. A, G, C and T in the DNA sequences (nucleotide bases in DNA or RNA sequences and amino acids for proteins).

Sequence alignment acts as an important step in solving problems like finding homology, determining origin of a sequence, protein structure prediction, for identifying new members of protein family, constructing evolutionary (phylogenetic trees), identifying mutations and for further higher level sequence analyses [1-2].

The sequence analyses provide the evidence that 99% of genome sequences of different individuals are identical [3] and the difference of 1% is due to the genetic variations which lead to human inherited diseases. Single Nucleotide Polymorphisms (SNPs), insertions/deletions, block substitutions, inversions, variable number of tandem repeat

ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11251 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265 sequences (VNTRs) and copy number variations (CNVs) are the few common genetic variations [4].

Sequence alignment is a pre-processing step for detection and identification of these genetic variations.

There are two categories of sequence alignment viz. Pairwise sequence alignment (PSA) and multiple sequence alignment (MSA). Pairwise sequence alignment involves comparison of two sequences to identify the matching regions. MSA involves comparison of more than two sequences where there can be ‘n’ query sequences to be compared against ‘n’ reference sequences. Before aligning the sequences, unequal sequences need to be made equal in length by inserting gaps in between. A large number of gap insertion algorithms are available. Optimality of a gap insertion algorithm relies on maximisation of number of matches.

Next generation technologies like Roche/454 (454 Life Sciences, 2013), Illumina (Illumina, 2013), Solid (SolidTM4 System, 2013) and large scale projects like Human Genome Project (Human Genome Project Information, 2013), 1000 Genomes Project (Home, 1000 Genomes, 2013), Genome 10K project (G.K.C.O Scientists, 2009) are generating large volumes of sequence data. Alignment of multiple sequences of huge lengths becomes a big challenge. As many high level applications depend upon alignment of sequences it becomes highly important to produce alignments with high accuracy, high speed, good quality and low computational complexity [5-7]. For primal MSA tools, time complexity is O(Ln) where L is length of sequence and ‘n’ is number of sequences. Earlier ‘L’ used to be large but ‘n’ used to be smaller than ‘L’ whereas in current situation ‘n’, the number of sequences has become larger than ‘L’. Clustal omega is the only MSA algorithm which can handle up to 190,000 sequences, aligning them in few hours on a single processor. In a study conducted in 2013, Sievers et al. (2011) compared 18 standard automated MSA tools and packages with respect to scalability. The study concluded that tools like PSAlign, Prank, FSA and Mummals can align up to 100 sequences. Tools like Probcons, MUSCLE, MAFFT, ClustalW and MSAProbs can align up to 1000 sequences. Tools such as Clustal Omega, Kalign and Part-Tree can align up to 50,000 sequences. Some MSA methods produce high quality output but do not scale well for thousands of sequences whereas there are few which provide good scalability but produce poor quality output [8]. Various researchers have reviewed the pairwise and multiple sequence alignment methods and approaches [9-12]. Comparison as well as evaluation of alignment methods has been done in certain studies [8], [13-16].

There are two types of sequence alignment viz. Local and global alignment. Local alignment is done to find highly similar local regions of similarity between two sequences. This is best suited for quite divergent sequences having local regions of high similarity. Global alignment strategy performs end-to-end alignment by comparing full lengths of two sequences against each other. Local alignment algorithms produce alignment without gaps hence the problem of fixing gap penality is resolved. Global alignments provide information for evolutionary comparisons and local alignments are useful for structural predictions [17].

The paper is structured as follows: Section 2 describes major approaches, methods and state-of-the-art algorithms & tools used for sequence alignment of DNA sequences. Section 3 presents performance evaluation including performance metrics and results of evaluation for various sequence alignment algorithms & tools. Section 4 presents conclusion drawn from the study.

ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11252 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265

2. Sequence Alignment Approaches

2.1 Dynamic Programming

Dynamic Programming is the approach that produces optimal alignment. Needleman-Wunsch algorithm is a DP based technique which produces global alignment and Smith-Waterman algorithm is another DP based technique which produces local alignment. Here, for obtaining MSA, we try to maximize the Sum of Pairs score obtained from the pairwise alignments of sequences.

There is no universally accepted objective function for MSA using DP approach. For pairwise alignments, time complexity of DP is O(Ln) where L is length of sequence and ‘n’ is number of sequences. DP produces optimal alignment for a pair of sequences but time complexity increases for multiple sequences. DP involves following steps [17]:

⚫ Every nucleotide in one sequence is compared to each and every nucleotide of the second sequence.

⚫ Results of this comparison are marked and stored in the form of m*n matrix where m*n defines size of the matrix.

⚫ All paths in the matrix are searched to find the optimal alignment with highest score.

DP provides optimum alignment for a given objective function for pairwise sequence alignment problem by trace-back procedure whereas this trace-back procedure involves exponential time for MSA [18]. MSA is an NP-complete problem where aim is to identify an MSA with maximum score among the set of found alignments. Hence, for MSA, more sophisticated heuristic methods are required [19]. Agarwal et al. (2005) proposed a more efficient version of DP which produces an optimal alignment. Bayat et al. (2019) proposed a DP based method which produces semi-global alignment where few of the first and last bases of compared sequences can be skipped. This method extracts ‘Maximal Exact Matches (MEMs) from compared sequences using shift and compare operations on the two sequences. This method is suitable where number of MEMs is lower than total number of bases (or nucleotides) in the sequences.

2.2 Heuristic Approach

Heuristic methods are not capable of giving optimal solutions but they provide feasible solution in short amount of time in contrast to DP based approach i.e. exact alignment approaches which provide high quality, near optimal results [18].

2.2.1 Pairwise sequence alignment

For pairwise sequence alignment, BLAST, BLAT and FASTA are the popular tools based on heuristic approach which produce faster solution in short amount of time.

2.2.1.1 BLAT: BLAT [20] stands for Blast like Local Alignment Tool. BLAT was written by Jim Kent. Its working principle is similar to that of Blast with an improvement that it stores index of reference sequence in memory rather than storing full sequence leading to low memory requirement and enhanced speed of alignment. Index is used to find areas of homology which can be further loaded into memory for a detailed alignment. BLAT is for finding sequences with high similarity from same or closely-related species. BLAT is available as a standalone tool, web application and its integrated with UCSC Genome Browser [21]. The steps involved in working of BLAT are discussed below:

ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11253 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265

⚫ Break query sequence into query words. L-w+1 words are formed where L is length of sequence and w is word size (w=3 by default)

⚫ Break reference sequence into query words

⚫ Compare the word list with database and find exact matches

⚫ Extend the match to neighbouring regions

⚫ Find High Scoring Segment Pairs (HSPs)

2.2.1.2 BLAST: BLAST stands for Basic Local Alignment. It was developed in 1990 by S.F. Altschul, W.Gish, W.Miller, E.W. Myers, D.J. Lipman and NCBI [22]. It produces local alignment and can be used for both pairwise and protein sequence alignment. It serves to DNA as well as protein sequence alignment. It can compare one query sequence to a database of sequences. BLAST enables species identification, locating domains, DNA mapping and annotation. BLAST has higher speed than FASTA program [23]. Different variants of BLAST are listed in table 1:

Table 1. Variants of BLAST

Program Query sequence Reference sequence

BLAST P Protein Protein

BLAST N Nucleotide Nucleotide

BLAST X Nucleotide (translated) Protein

TBLASTN Protein Nucleotide (translated)

TBLASTX Nucleotide (translated) Nucleotide (translated)

The steps involved in working of BLAT are discussed below:

⚫ Break query sequence into query words. L-w+1 words are formed where L is length of sequence and w is word size (w=3 by default)

⚫ Break reference sequence into query words

⚫ Compare the word list with database and find exact matches

⚫ Extend the match to neighbouring regions

⚫ Find High Scoring Segment Pairs (HSPs)

2.2.1.3 FASTA: FASTA was developed in 1995. This is an improved version of FASTP developed by D.J. Lipman and W.R. Pearson in 1985 [24-25]. FASTP was used for protein sequences only whereas FASTA is suitable for DNA versus DNA, translated protein versus DNA and for evaluating statistical significance. TFASTAX, TFASTAY, FASTAX and FASTAY are the various programs of FASTA. FASTA is found to be more accurate (in terms of sensitivity) than BLAST [23]. FASTA is derived from concept of dot plot. It computes best diagonals from all frames of alignment. It looks for exact matches between words in query sequence and reference sequence.

ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11254 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265

The steps involved in working of FASTA are discussed below:

⚫ Identify common k-words between I and J where I denotes query sequence and J denotes reference sequence. (words for DNA are 6 nucleotide long and words for proteins are 2 bases long like AGTCCA)

⚫ Score diagonals with k-word matches, identify 10 best diagonals. High scoring diagonals are selected where offset is given by i-j

⚫ Rescore initial regions with a substitution score matrix like PAM

⚫ Join initial regions using gaps; then penalise the gaps

⚫ Perform Dynamic Programming to find final alignments

The tool FASTA gives z-value and e-score to measure the significance of alignment.

2.2.2 Multiple Sequence Alignment

Progressive alignment and Iterative alignment are two popular techniques based on heuristic approach used for MSA.

2.2.2.1 Progressive alignment: Progressive alignment is the most popular heuristic for producing MSA [26]. It was developed by Feng and Doolittle [27]. In this technique, a scoring matrix is prepared based on similarity. Sequences are aligned in the order of their similarity. The steps involved in progressive alignment technique are discussed below:

⚫ Perform pairwise alignments using Needleman-Wunsch algorithm, Smith-Waterman algorithm, k-mer algorithm or k-tuple algorithm

⚫ Next, clustering of sequences is done using mBed or k-means algorithm [28]

⚫ Distance scores are obtained from similarity scores and construction of guide trees (or dendograms) is done from distance scores using Neighbour-Joining and Unweighted Pair Group Method with Arithmetic Mean (UPGMA)

⚫ Based on the guide tree, most similar sequences are first aligned followed by less similar sequences and so on

It generally takes time of O(N2) for few thousands of sequences of medium length where N is number of sequences [8].This approach was proposed by a number of researchers [27], [29-33]. A number of tools are based on progressive alignment approach such as ClustalW [34], Clustal Omega [8], MAFFT [35], Kalign [13], Probalign [36], MUSCLE [37], DIALIGN [38], PRANK [39], FSA [40], T-Coffee [41], ProbCons (Notredame et al., 2000), MULTALIGN (Barton and Sternberg, 1987), MULTAL[42], MAP [43], PCMA [44], MUMMALS/PROMALS [45-46] and MSAProbs [47]. Progressive alignment produces fast, efficient and reasonable alignment. Major problem with Progressive alignment approach is that it considers only two sequences at one time where rest of the sequences are ignored. This problem makes it a greedy approach and optimal results cannot be found with this. Second problem with this approach is that quality of final result relies upon initially selected pair of sequences hence error propagates throughout the alignment and cannot be fixed at later stages. This problem was overcome by Gotoh. Third problem is about selection of gap parameters [17].

ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11255 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265

➢ ClustalW

ClustalW [34] represents third generation of Clustal series of programs. It was developed by Thompson in 1994. most popular methods for global MSA belong to this clustal family. In 1988, first Clustal program was proposed by DES Higgins. It was designed to be used on personal computer with quite low computing power. It was based on DP and progressive alignment approach. In 1992, another tool named ClustalV was developed where alignment of alignments was done termed as ‘profile alignment. Then the tree was generated using Neighbor-Joining (NJ) method [48].The steps involved in working of ClustalW are discussed below:

⚫ Perform alignment between every pair of sequences using k-tuple method proposed by Wilbur and Lipman [49] or Needleman-Wunsch algorithm [50]

⚫ Convert the similarity scores of pairwise alignments to distance scores

⚫ Construct guide tree based on distance scores using Neighbour-Joining method

⚫ Finally MSA is obtained by progressively aligning the sequences in the order given by the guide tree beginning from tip of the tree going down to its root

ClustalW is found to have better quality, sensitivity and speed as compared to its counterparts. Sometimes, different weights are attached to gaps which causes problem of having different result in the end [17].

➢ MUSCLE

MUSCLE stands for MUltiple Sequence Comparison by Log-Expectation. It is based on progressive alignment approach. It performs Multiple sequence alignment. It has better accuracy and higher speed of alignment than ClustalW2 and T-Coffee. The steps involved in working of MUSCLE are discussed below [37]:

⚫ Guide tree is constructed using UPGMA method

⚫ Guide tree guides the progressive alignment to produce the initial MSA

⚫ Next, Kimura distance method is used to re-estimate the initial guide tree

⚫ Progressive alignment is done using second guide tree producing second MSA

⚫ If the SP score gets improved with second MSA then new alignment is kept otherwise it is discarded and first alignment is kept.

Edgar came up with MUSCLE-fast, an improved version of the program MUSCLE which provides higher accuracy and speed. It can align 1000 sequences of average length of 282 in around 21 seconds on a personal computer [51].

➢ Clustal Omega

It has replaced older Clustal-W tool. It is the latest tool from Clustal suite of tools. Accuracy of Clustal Omega is similar to its other high-quality counterparts whereas for large number of sequences it performs better than its counterparts. It has lower execution time and high accuracy. Seivers et al. In their study used Clustal Omega for 1,90,000 sequences on single processor in few hours [8].

⚫ Pairwise alignments are produced using k-tuple method (same as used by older Clustal-W)

ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11256 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265

⚫ Sequences are clustered using mBed method which works by embedding each sequence in a space of ‘n’ dimensions where ‘n’ is proportional to logN. mBed has a complexity of O(NlogN)

⚫ K-means++ clustering [28] is used for clustering of sequences. This algorithm eliminates the problem of selecting initial cluster centers in k-means hence improving its speed and accuracy

⚫ UPGMA method is used to construct guide tree

⚫ HHalign package is used to produce final MSA [8]. HHalign was proposed by Johannes Soding in 2005 [52].

➢ T-Coffee

T-Coffee [15] is based on progressive alignment approach. It performs both Pairwise and multiple sequence alignment. It stands for tree based consistency objective function for alignment evolution. It performs MSA using iterative approach. Its major shortcoming can align up to 100 sequences only without affecting accuracy. T-Coffee is found to have 5-10% better accuracy than Clustal-W. The steps involved in working of T-Coffee are discussed below:

⚫ Distance matrix is produced from pairwise alignments

⚫ Guide tree is formed using Neighbour-Joining method

⚫ Guide tree guides the grouping of sequences during MSA

⚫ DP is used to align two most similar sequences and process goes on until all sequences are aligned

Tommaso et al. (2011) developed a new interface for T-Coffee. There is a standard T-Coffee mode for proteins and nucleotide sequences, M-Coffee mode which combines alignment done by various methods and template based mode of T-Coffee. It is a paralleled algorithm which provides higher accuracy and can align up to 150 sequences [41].

➢ M-Coffee

M-Coffee method combines the alignment done by various methods. A distance matrix is computed where each value depicts difference between two methods in terms of number of bases aligned similarly in their results. Based on the distance matrix, methods are arranged int he form of a tree. Then their outputs are combined as per the method tree. It is found to deliver better results than any of the single method [14].

➢ MultAlin

This method proposed by Corpet in 1988 is based on DP approach of pairwise sequence alignment. It performs multiple sequence alignment with hierarchical clustering. Steps involved in its working are as follows:

⚫ Firstly, closest sequences are aligned to form set of sequences and then all sequences in these sets are aligned.

⚫ Now, sequences are aligned in hierarchical order forming another matrix.

ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11257 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265

⚫ If this alignment is different from the one produced in step 1, process is iterated until they converge [17].

➢ DIALIGN

This tool is used for pairwise as well as multiple sequence alignment. It can identify similarity at local level when sequences are not globally identical. It came up in 1996 at University of Bielefeld. Its original version works with the help of primary-sequence based information only without requiring human input. Advanced versions of DIALIGN make use of expert knowledge. It can be used for protein as well as DNA sequences. Presently, Univeristy of Gottingen is carrying out extensive research on DIALIGN [53]. Schmollinger et al. (2004) came up with idea of executing DIALIGN on multiple processors to reduce running time required for alignment. Running time of DIALIGN was reduced up to 97% [54]. Subramanian et al. (2005) came up with DIALIGN-T, an improved version of DIALIGN [55].

2.2.2.2 Iterative alignment: Iterative approach is an improvement of progressive alignment. The underlying principle of this approach is same as that of progressive alignment with repeated application of DP to perform re-alignment of sequences so that overall alignment quality is improved. This also removes any errors introduced in initial alignment hence enhancing overall accuracy of alignment [56]. Iterative alignment approach is applicable to few hundred sequences only [8]. A number of algorithms and tools are available based on iterative approach such as PRRP [57], MUSCLE [37], Dialign [58], SAGA [59], MAFFT [35], PRIME [60] and T-Coffee [15].

➢ MAFFT

MAFFT stands for Multiple Alignment using Fast Fourier Transform. It is based on progressive and iterative alignment approach [35]. The various steps involved in working of MAFFT are discussed below:

⚫ Fast Fourier Transform is used to identify homologous regions

⚫ A simple scoring system is used to reduce the CPU time and improve the accuracy of alignment

MAFFT is based upon two-cycle heuristics viz. Progressive method (FFT-NS-2) and iterative refinement method (FFT-NS-i). Firstly, FFT-NS-2 is used to calculate pairwise distances and then FFT-NS-i is used to refine the calculated distances. FFT-NS-2 is more accurate than FFT-NS-i whereas FFT-NS-i performs faster alignments. The option Part tree can be used which offers high scalability in alignment for up to 50,000 sequences [35].

➢ Kalign

Kalign is a global progressive alignment method based on a string matching algorithm called Wu-Manber. It is used to calculate distance between sequences. It introduces factor of local matches in the global alignment strategy. Distance between the two sequences is given by Levenshtein edit distance. There is a distance ‘d’ between two sequences P and Q if P can be changed to Q through transformations with application of ‘d’ number of mismatches, insertions or deletions. Following steps are involved in working of Kalign:

⚫ Firstly, pairwise distances are calculated using k-tuple method.

⚫ Next, guide tree is formed using UPGMA or Neighbor-Joining method.

ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11258 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265

Kalign provides higher speed and accuracy even for large number of sequences. It is found to be faster than ClustalW [13]. 2.3 Burrows Wheeler Aligner (BWA)

BWA [61] is based upon Burrows Wheeler Transform (BWT). It is used for compression and pattern matching. After performing BWT, a string comprising of last characters and an index are obtained which enable pattern matching. This principle is used in BWA to align the sequences. It is good for mapping less divergent sequences against a large reference genome such as human genome. It has four different algorithms:

⚫ BWA-backtrack: It is used for sequence reads up to 100 bp.

⚫ BWA-SW: It is used for sequence reads up to 70 bp-1Mbp.

⚫ BWA-MEM: It is used for sequence reads ranging from 70 bp up to 1 Mbp.

⚫ BWA aln/SAMSE/SAMPE 2.4 Bowtie

There are many new algorithms and tools coming up for short sequence reads. BOWTIE [62], Maq [63] and SOAP [64] are few prominent tools in this category. It produces very fast alignments and requires less memory space. Bowtie 2 is an improved version of Bowtie. Bowtie 2 is available as a part of another tool named Codon Code Aligner which is GUI tool. Codon Code Aligner performs sequence alignment, assembly and mutation detection.

2.5 Hidden Markov Model Approach

ProbCons [65] is a method based on Hidden Markov Model, a statistical model [18]. It is a progressive alignment method which combines probabilistic modeling and consistency based alignment. MUMMALS and PROMALS are extended form of ProbCons.

2.6 Natural Computing Approach

Methods based on heuristic approach are fast but they do not provide optimal solutions. Natural computing algorithms are gaining importance as they provide optimal or near-optimal solution. A number of hybrid techniques have been proposed by various researchers based upon the natural computing algorithms [66-69]. These are also called stochastic methods. They aim at solving complex and poorly defined optimization problems. Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO) and Artificial Bee Colony are the prominent names in this category. A number of genetic algorithms have been proposed as discussed by Chowdhury and Garai in their study [18]. A new algorithm has been developed based on Flower Pollination Algorithm (FPA) termed as Pairwise Sequence Alignment with Flower Pollination Algorithm (PSAFPA). It is inspired from the pollination of flowering plant species [70].

Table 2. List of few Sequence Alignment Algorithms with their URLs

Algorithm/Tool URL

ClustalW [34] http://www.ebi.ac.uk/clustalw/

ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11259 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265

T-Coffee [15] www.ebi.ac.uk/Tools/msa/tcoff ee/

DiAlign-T [55] http://dialign-t.gobics.de/

MAFFT [35] www.ebi.ac.uk/Tools/msa/mafft /

BLAST [22] .ncbi.nlm.nih.gov.

BLAT [20] http://genome.ucsc.edu/cgi-bin/ hgBlat

FASTA [24-25] ncbi.nlm.nih.gov

Clustal Omega [8] www.ebi.ac.uk/Tools/msa/clust alo/

Muscle [37] www.ebi.ac.uk/Tools/msa/musc le/

PRIME [60] http://prime.cbrc.jp/.

Burrows Wheeler Aligner (BWA) [61] http://bio-bwa.sourceforge.net/

Bowtie2 [62] https://www.codoncode.com/ali gner/

3. Performance Evaluation of Sequence Alignment tools

Evaluation Metrics

As sequence alignment acts as very important pre-processing step for further high level tasks such as identification of genetic variations, it becomes highly important to produce alignments with high accuracy, high speed, good quality and low computational complexity [5], [51], [71]. First aspect of accuracy is that how well the produced alignment adheres to true alignment of sequences. Second aspect is that whether the insertions, deletions and gaps are indicated at the right positions. Computational complexity includes time, memory and CPU requirements. Complexity of an MSA tool is calculated as O(Ln) where O is complexity, L is length of the sequence and ‘n’ is number of sequences to be aligned [19]. Identity score and Identity percentage (percentage of identity score) are quite suitable measures of alignment quality. Identity score reflects number of matches found against the total number of compared nucleotide bases. Higher the identity score , higher is the similarity between compared sequences hence better is the alignment quality of algorithm (or tool). Identity percentage for unequal sequences is measured by the following formula:

%Identity= {(2*S)/(L1+L2)} *100

Where L1 is Length of seq1

L2 is Length of seq2

S is No. of matches

ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11260 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265

For equal sequences, Identity percentage is given by the formula:

%Identity= (No.of matches/Total length of sequence)*100

One more way to evaluate an alignment is to compare this to an accurate reference alignment. This reference alignment is true and is termed as test case. Julie D. Thompson suggested this evaluation system. It is found that reference alignment i.e. test case selected for evaluation affects the evaluation process and also all algorithms do not work for given problems in test cases in the same manner. Lambert et al. (2003) in their study suggest that users should select the alignment method depending upon sensitivity and selectivity i.e. set of sequences to be aligned. Reliability of results lies in consensus, if can be built from results of different methods.

3.2 Results and Discussion

In this study, various tools and algorithms are studied, implemented and their performance is evaluated and compared. DNA sequences are retrieved from National Center for Biotechnology Information (NCBI) (National Center for Biotechnology Information available at https://www.ncbi.nlm.nih.gov/). In this study, Identity Percentage is used as a measure of alignment quality to evaluate and compare the discussed algorithms and tools. Results of sequence alignment for three pairs of sequences are presented in table 3.

Table 3. Percentage Identity values for various Sequence Alignment Algorithms

Percentage Identity

Algorithm/ Tool Seq1 & Seq3& Seq5 & Seq2 Seq4 Seq6

Clustal Omega Emboss Needle 47.8% 62.3% 35.6%

Clustal Omega Emboss Stretcher 46.2% 64.9% 48.6%

Clustal Omega Emboss Water 70.8% 78.7% 55.6%

Clustal Omega Emboss Matcher 100% 78.7% 84.4%

MUSCLE 44.19% 62.67% 54.05%

T-Coffee (using M-Coffee) 38.46% 56.75% 50.64%

T-Coffee (ebi.ac.uk) 43.59% 83% 50.67%

MAFFT 41.02% 59.45% 50.67%

Blast (Needleman-Wunsch) 51% 65% 52%

Clustal W 43.59% - 50.67%

PSAFPA 38.35% 45.11% 48.64%

From the results, it is observed that for pairwise sequence alignment, Clustal Omega Emboss Matcher outperforms other tools & algorithms followed by Clustal Omega Emboss Water further followed by Blast (Needleman-Wunsch algorithtm for global

ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11261 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265 alignment). Apart from these algorithms, Burrows Wheeler Aligner (BWA) and Bowtie2 (as part of Codon Code Aligner) are also implemented as part of this study. 4. Conclusions

The biggest challenge in sequence alignment is to produce alignment results with good quality, high accuracy, high speed and low computational complexity. Multiple Sequence Alignment (MSA) is a computationally intense task where complexity increases manifolds than pairwise sequence alignment. Speed of alignment and computational complexity is adversely affected when length of sequences increases.Heuristic methods provide feasible alignment in faster way than the Dynamic Programming approach which has high time complexity. Manual refinement of alignment results continues to dominate as the fully automated algorithms are not able meet the demand. Big data technologies, parallelism and cloud computing offer promising solution for alignment of large number of sequences in shorter amount of time. It is observed from the evaluation of methods that no single method performs good for every test sequence. Some methods are good for long sequences whereas others provide high quality solution for small sequences. A hybrid method combining the features of several state-of-the-art methods can be a good solution. References

1. X. Xia, Editor, “Bioinformatics and the Cell: Modern Computational Approaches in Genomics: Proteomics and Transcriptomics”, vol. XVI. 2. E.M. Mohamed, H.M. Mousa and A.E. Keshk, “Comparative Analysis of Multiple Sequence Alignment Tools”, I.J. Information Technology and Computer Science, vol. 8, (2018), pp. 24-30. 3. R. Sachidanandam, D. Weissman, S.C. Schmidt, J.M. Kakol, L.D. Stein, G. Marth, S. Sherry, J.C. Mullikin, B.J. Mortimore, D.L. Willey, S.E. Hunt, C.G. Cole, P.C. Coggill, Z. Ning, J. Rogers, D.R. Bentley, P.Y. Kwok, E.R. Mardis, R.T. Yeh, B. Schultz, L.Cook, R. Davenport, M.Dante, L. Fulton, L. Hillier, R.H. Waterston, J.D. McPherson, B. Gilman, S. Schaffner, W.J. Van Etten, D. Reich, J. Higgins, M.J.Daly, B. Blumenstiel, J. Baldwin, N. Stange-Thomann, M.C. Zody, L. Linton, E.S. Lander, D. Altshuler and International SNP Map Working Group, “A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms”, Nature, vol. 409, (2001), pp. 928-933. 4. F.S. Collins, L.D. Brooks and A. Chakravarti, “A DNA polymorphism discovery resource for research on human genetic variation”, Genome Research, vol.8, (1998), pp. 1229–1231. 5. C. Kemena and C. Notredame, “Upcoming challenges for multiple sequence alignment methods in the high-throughput era”, Bioinformatics, vol.25, no.19, (2009), pp. 2455-2465. 6. R.C. Edgar and S. Batzoglou, “Multiple Sequence Alignment”, Current Opinion in Structural Biology, vol. 16, no.3, (2006), pp. 368-373. 7. C. Notredame, “Recent Evolutions of Multiple Sequence Alignment Algorithms”, PLoS , vol. 3, no.8, (2007). 8. F. Sievers, A. Wilm, D. Dineen,T.J. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M. Remmert, J. Soding, J.D. Thompson and D.G. Higgins, “Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega”, Molecular Systems Biology,vol. 7, (2011) . 9. M. Chatzou, C. Magis, J.M. Chang, , C. Kemena, G. Bussotti, I. Erb and C. Notredame, “Multiple sequence alignment modeling: methods and applications”, Briefings in Bioinformatics, vol. 17, no. 6, (2015), pp. 1009–1023. 10. D.J. Russell, Editor, “Multiple Sequence Alignment Methods”, Humana Press, New York: Humana Press, (2014). 11. C. Notredame, “Recent progress in multiple sequence alignment: a survey”, Pharmacogenomics, vol. 3, no. 1, (2002), pp. 131-144. 12. H. Li and N. Homer, “A survey of sequence alignment algorithms for next-generation Sequencing”, Briefings in Bioinformatics, vol. 11, no. 5, (2010), pp. 473-483. 13. T. Lassmann and E.L.L. Sonnhammer, “Kalign – an accurate and fast multiple sequence alignment algorithm”, BMC Bioinformatics, vol. 6, (2005), pp. 298. 14. I.M. Wallace, O. O’Sullivan, D.G. Higgins and C. Notredame, “M-Coffee: combining multiple sequence alignment methods with T-Coffee”, Nucleic Acids Research, vol. 34, no. 6, (2006), pp. 1692–1699. 15. C. Notredame, D.G. Higgins and J. Heringa, “T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment”, Journal of Molecular Biology, vol. 302, (2000), pp. 205-217.

ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11262 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265

16. J.D. Thompson, B. Linard, O. Lecompte and O. Poch, “A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives”, PLoS ONE, vol. 6, no. 3, (2011). 17. C. Lambert, J.M.V. Campenhout, X. DeBolle and E. Depiereux, “Review of Common Sequence Alignment Methods: Clues to Enhance Reliability”, Current Genomics, vol. 4, (2003), pp. 131-146. 18. B. Chowdhury and G. Garai, “A review on multiple sequence alignment from the perspective of genetic algorithm”, Genomics, vol. 109, (2017), pp. 419-431. 19. J. Daugelaite, A.O. Driscoll and R.D. Sleator, “An Overview of Multiple Sequence Alignments and Cloud Computing in Bioinformatics”, ISRN Biomathematics, vol. 2013, (2013). 20. W.J. Kent, “BLAT-The BLAST-Like Alignment Tool”, Genome Research, vol. 12, (2002), pp. 656-664. 21. M. Bhagwat, L. Young and R.R. Robison, “Using BLAT to Find Sequence Similarity in Closely Related Genomes”, Current Protocols in Bioinformatics, (2012), pp. 10.8.1-10.8.24. 22. S.F. Altschul, W. Gish, W. Miller, E.W. Myers and D.J. Lipman, “Basic alignment Search tools”, Journal of Molecular Biology, vol. 215, (1990), pp. 403-410. 23. E.S. Donkor, N.T.K.D. Dayie and T.K. Adiku, “Bioinformatics with basic local alignment search tool (BLAST) and fast alignment (FASTA)”, Journal of Bioinformatics and Sequence Analysis, vol. 6, no. 1, (2014), pp. 1-6. 24. D.J. Lipman and W.R. Pearson, “Rapid and sensitive protein similarity searches”, Science, vol. 227, (1985), pp. 1435-1441. 25. W.R. Pearson, “Rapid and sensitive sequence comparison with FASTP and FASTA”, Methods Enzymology, vol. 183, (1990), pp. 63-98. 26. G.J. Barton and M.J. Sternberg, “A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons”, Journal of Molecular Biology, vol. 198, (1987), pp. 327-337. 27. D.F. Feng and R.F. Doolittle, “Progressive sequence alignment as a prerequisite to correct phylogenetic trees”, Journal of Molecular Evolution, vol. 25, (1987), pp. 351-360. 28. D. Arthur and S. Vassilvitskii, “k-means++: the advantages of careful seeding”, Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics. 29. Main Page-KVM, 2013, http://www.linux-kvm.org/page/ 30. P. Hogeweg and B. Hesper, “The alignment of sets of sequences and the construction of phyletic trees : an integrated method”, Journal of Molecular Evolution, vol. 20, (1984), pp. 175-186. 31. M.S. Waterman and M.D. Perlwitz, “Line geometries for sequence comparisons”, Bulletin of Mathematical Biology, vol.46, (1984), pp. 567-577. 32. W.R. Taylor, “Multiple sequence alignment by a pairwise algorithm”, Computer Applications in the Biosciences, vol. 3, (1987), pp. 81-87. 33. D.G. Higgins and P.M.Sharp, “Fast and sensitive multiple sequence alignments on a microcomputer”, Computer Applications in the Biosciences, vol. 8, (1989), pp. 189-191. 34. J.D. Thompson, D.G. Higgins and T.J. Gibson, “CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice”, Nucleic Acids Research, vol. 22, no. 22, (1989), pp. 4673-4680. 35. K. Katoh and D.M. Standley, “MAFFT multiple sequence alignment software version 7: improvements in performance and usability”, Molecular Biology and Evolution, vol. 30, no. 4, (2013), pp. 772-780. 36. U. Roshan and D.R. Livesay, “Probalign: multiple sequence alignment using partition function posterior probabilities”, Bioinformatics, vol. 22, no. 22, (2006), pp. 2715–2721. 37. R.C. Edgar, “MUSCLE: a multiple sequence alignment method with reduced time and space complexity”, BMC Bioinformatics, vol. 5, (2004). 38. B. Morgenstern, “DIALIGN: multiple DNA and protein sequence alignment at BiBiServ”, Nucleic Acids Research, vol. 32, no. 2, (2004), pp.W33–W36. 39. A. Loytynoja and N. Goldman, “Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis”, Science, vol. 320, no. 5883, (2008), pp. 1632–1635.

40. R.K. Bradley, A. Roberts, M. Smoot, S. Juvekar, J. Do, C. Dewey, I. Holmes and L. Pachter, “Fast statistical Alignment”, PLoS Computational Biology, vol. 5, no. 5, (2009). 41. P. Di Tommaso, S. Moretti, I. Xenarios, M. Orobitg, A. Montanyola, J.M. Chang, J.F. Taly and C. Notredame, “T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension”, Nucleic Acids Research, vol. 39, no. 2, (2011), pp. W13–W17.

ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11263 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265

42. W.R. Taylor, “Multiple sequence alignment by a pairwise algorithm”, Computer Applications in the Biosciences, vol. 3, (1987), pp. 81-87. 43. X. Huang, “On global sequence alignment”, Computer Applications in the Biosciences, vol. 10, (1994), pp. 227-235. 44. J. Pei, R. Sadreyev and N.V.Grishin, “PCMA: fast and accurate multiple sequence alignment based on profile consistency”, Bioinformatics, vol. 19, (2003), pp. 427-428. 45. J. Pei and N.V. Grishin, “MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information”, Nucleic Acids Research, vol. 34, (2006), pp. 4364-4374. 46. J. Pei and N.V.Grishin, “PROMALS: towards accurate multiple sequence alignments of distantly related protein”, Bioinformatics, vol. 23, (2006), pp. 802-808. 47. Y. Liu, B. Schmidt and D.L. Maskell, “MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities”, Bioinformatics, vol. 26, no. 16, (2010), pp. 1958-1964. 48. R. Chenna, H. Sugawara, Koike, T.J. Gibson, D.G. Higgins and J.D. Thompson, “Multiple sequence alignment with the Clustal series of programs”, Nucleic Acids Research, vol. 31, no. 13, (2003), pp. 3497-3500. 49. W.J. Wilbur and D.J. Lipman, “Rapid similarity searches of nucleic acid and protein data banks”, in the Proceedings of the National Academy of Sciences of the United States of America, vol. 80, no.3, (1983), pp. 726-730. 50. S.B. Needleman and C.D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins”, Journal of Molecular Biology, vol. 48, no. 3, (1970), pp. 443-453. 51. R.C. Edgar and S. Batzoglou, “Multiple Sequence Alignment”, Current Opinion in Structural Biology, vol. 16, no.3, (2006), pp. 368-373. 52. J. Soding, “Protein homology detection by HMM-HMM comparison”, SOLiDTM4 System (2013), (2005). 53. L.A. Ait, Z. Yamak and B. Morgenstern, “DIALIGN at GOBICS-multiple sequence alignment using various sources of external information”, Nucleic Acids Research, vol. 41, (2013), pp. W3-W7. 54. M. Schmollinger, K.Nieselt, M. Kaufmann and B. Morgenstern, “DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors”, BMC Bioinformatics, vol.5, (2004), pp.128. 55. A.R. Subramanian, J. Weyer-Menkhoff, M. Kaufmann and B. Morgenstern, “DIALIGN-T: An improved algorithm for segment-based multiple sequence alignment”, BMC Bioinformatics, vol. 6, (2005), pp. 66. 56. D.W. Mount, “Using iterative methods for global multiple sequence alignment”, Cold Spring Harbor Protocols, vol. 4, no. 7, (2009). 57. O. Gotoh, “Optimal alignment between groups of sequences and its application to multiple sequence alignment”, Computer Applications in the Biosciences, vol. 9, no. 3, (1993), pp. 361-370. 58. B. Morgenstern, “DIALIGN: multiple DNA and protein sequence alignment at BiBiServ”, Nucleic Acids Research, vol. 32, no. 2, (2004), pp.W33–W36. 59. C. Notredame and D.G. Higgins, “SAGA: sequence alignment by genetic algorithm”, Nucleic Acids Research, vol. 24, no. 8, (1996), pp. 1515-1524. 60. S. Yamada, O. Gotoh and H. Yamana, “Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost”, BMC Bioinformatics, vol. 7, (2006), pp. 524. 61. H. Li and R. Durbin, “Fast and accurate short read alignment with Burrows-Wheeler Transform”, Bioinformatics, vol. 25, (2009), pp. 1754-1760. 62. B. Langmead, C.Trapnell, M. Pop and S.L. Salzberg, “Ultrafast and memory-efficient alignment of short DNA sequences to the human genome”, Genome Biology, vol. 10, no. 3, (2009). 63. H. Li, J. Ruan and R. Durbin, “Mapping short DNA sequencing reads and calling variants using mapping quality scores”,Genome Research, vol. 18, no. 11, (2008), pp. 1851–1858. 64. R. Li,Y. Li, K. Kristiansen and J. Wang, “SOAP: short oligonucleotide alignment program”, Bioinformatics, vol. 24, no. 5, (2008), pp. 713-714. 65. C.B. Do, M.S.P. Mahabhashyam, M. Brudno and S. Batzoglou, “ProbCons: probabilistic consistency-based multiple sequence alignment”, Genome Research, vol. 15, no. 2, (2005), pp. 330-340. 66. S.R. Jangam and N. Chakraborti, “A novel method for alignment of two nucleic acid sequences using ant colony optimization and genetic algorithm”, Applied Soft Computing, vol. 7, no. 3, (2007), pp.1121-1130. 67. C. Gondro and B.P. Kinghorn, “A simple genetic algorithm for multiple sequence alignment”, Genetics and Molecular Research, vol. 6, no. 4, (2007), pp.964-982.

ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11264 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265

68. Z.J. Lee, A.F. Su, C.C. Chuang and K.H. Liu, “Genetic algorithm with ant colony optimization (GA-ACO) for multiple sequence alignment”, Applied Soft Computing, vol. 8, no.1, (2008), pp.55-78. 69. G. Garai and B. Chowdhury, “A cascaded pairwise biomolecular sequence alignment technique using evolutionary algorithm”, Information Sciences, vol. 297, (2015), pp.118-139. 70. Y. Kaur and N. Sohi, “Pairwise Sequence Alignment Method Using Flower Pollination Algorithm”, 4th IEEE International Conference on Signal Processing, Computing and Control, Sept 21-23, 2017, Solan, India.(2017), pp. 408-413. 71. C. Notredame, “Recent Evolutions of Multiple Sequence Alignment Algorithms”, PLoS Computational Biology, vol. 3, no.8, (2007).

ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11265