A Comprehensive Review and Performance Evaluation of Sequence Alignment Algorithms for DNA Sequences

International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265 A Comprehensive Review and Performance Evaluation of Sequence Alignment Algorithms for DNA Sequences Neelofar Sohi1 and Amardeep Singh2 1Assistant Professor, Department of Computer Science & Engineering, Punjabi University Patiala, Punjab, India 2Professor, Department of Computer Science & Engineering, Punjabi University Patiala, Punjab, India [email protected],[email protected] Abstract Background: Sequence alignment is very important step for high level sequence analysis applications in the area of biocomputing. Alignment of DNA sequences helps in finding origin of sequences, homology between sequences, constructing phylogenetic trees depicting evolutionary relationships and other tasks. It helps in identifying genetic variations in DNA sequences which might lead to diseases. Objectives: This paper presents a comprehensive review on sequence alignment approaches, methods and various state-of-the-art algorithms. Performance evaluation and comparison of few algorithms and tools is performed. Methods and Results: In this study, various tools and algorithms are studied, implemented and their performance is evaluated and compared using Identity Percentage is used as the main metric. Conclusion: It is observed that for pairwise sequence alignment, Clustal Omega Emboss Matcher outperforms other tools & algorithms followed by Clustal Omega Emboss Water further followed by Blast (Needleman-Wunsch algorithm for global alignment). Keywords: sequence alignment; DNA sequences; progressive alignment; iterative alignment; natural computing approach; identity percentage. 1. Brief History Sequence alignment is very important area in the field of Biocomputing and Bioinformatics. Sequence alignment aims to identify the regions of similarity between two or more sequences. Alignment is also termed as ‘mapping’ that is done to identify and compare the nucleotide bases i.e. A, G, C and T in the DNA sequences (nucleotide bases in DNA or RNA sequences and amino acids for proteins). Sequence alignment acts as an important step in solving problems like finding homology, determining origin of a sequence, protein structure prediction, for identifying new members of protein family, constructing evolutionary (phylogenetic trees), identifying mutations and for further higher level sequence analyses [1-2]. The sequence analyses provide the evidence that 99% of genome sequences of different individuals are identical [3] and the difference of 1% is due to the genetic variations which lead to human inherited diseases. Single Nucleotide Polymorphisms (SNPs), insertions/deletions, block substitutions, inversions, variable number of tandem repeat ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11251 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265 sequences (VNTRs) and copy number variations (CNVs) are the few common genetic variations [4]. Sequence alignment is a pre-processing step for detection and identification of these genetic variations. There are two categories of sequence alignment viz. Pairwise sequence alignment (PSA) and multiple sequence alignment (MSA). Pairwise sequence alignment involves comparison of two sequences to identify the matching regions. MSA involves comparison of more than two sequences where there can be ‘n’ query sequences to be compared against ‘n’ reference sequences. Before aligning the sequences, unequal sequences need to be made equal in length by inserting gaps in between. A large number of gap insertion algorithms are available. Optimality of a gap insertion algorithm relies on maximisation of number of matches. Next generation sequencing technologies like Roche/454 (454 Life Sciences, 2013), Illumina (Illumina, 2013), Solid (SolidTM4 System, 2013) and large scale projects like Human Genome Project (Human Genome Project Information, 2013), 1000 Genomes Project (Home, 1000 Genomes, 2013), Genome 10K project (G.K.C.O Scientists, 2009) are generating large volumes of sequence data. Alignment of multiple sequences of huge lengths becomes a big challenge. As many high level applications depend upon alignment of sequences it becomes highly important to produce alignments with high accuracy, high speed, good quality and low computational complexity [5-7]. For primal MSA tools, time complexity is O(Ln) where L is length of sequence and ‘n’ is number of sequences. Earlier ‘L’ used to be large but ‘n’ used to be smaller than ‘L’ whereas in current situation ‘n’, the number of sequences has become larger than ‘L’. Clustal omega is the only MSA algorithm which can handle up to 190,000 sequences, aligning them in few hours on a single processor. In a study conducted in 2013, Sievers et al. (2011) compared 18 standard automated MSA tools and packages with respect to scalability. The study concluded that tools like PSAlign, Prank, FSA and Mummals can align up to 100 sequences. Tools like Probcons, MUSCLE, MAFFT, ClustalW and MSAProbs can align up to 1000 sequences. Tools such as Clustal Omega, Kalign and Part-Tree can align up to 50,000 sequences. Some MSA methods produce high quality output but do not scale well for thousands of sequences whereas there are few which provide good scalability but produce poor quality output [8]. Various researchers have reviewed the pairwise and multiple sequence alignment methods and approaches [9-12]. Comparison as well as evaluation of alignment methods has been done in certain studies [8], [13-16]. There are two types of sequence alignment viz. Local and global alignment. Local alignment is done to find highly similar local regions of similarity between two sequences. This is best suited for quite divergent sequences having local regions of high similarity. Global alignment strategy performs end-to-end alignment by comparing full lengths of two sequences against each other. Local alignment algorithms produce alignment without gaps hence the problem of fixing gap penality is resolved. Global alignments provide information for evolutionary comparisons and local alignments are useful for structural predictions [17]. The paper is structured as follows: Section 2 describes major approaches, methods and state-of-the-art algorithms & tools used for sequence alignment of DNA sequences. Section 3 presents performance evaluation including performance metrics and results of evaluation for various sequence alignment algorithms & tools. Section 4 presents conclusion drawn from the study. ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 11252 International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 11251 - 11265 2. Sequence Alignment Approaches 2.1 Dynamic Programming Dynamic Programming is the approach that produces optimal alignment. Needleman-Wunsch algorithm is a DP based technique which produces global alignment and Smith-Waterman algorithm is another DP based technique which produces local alignment. Here, for obtaining MSA, we try to maximize the Sum of Pairs score obtained from the pairwise alignments of sequences. There is no universally accepted objective function for MSA using DP approach. For pairwise alignments, time complexity of DP is O(Ln) where L is length of sequence and ‘n’ is number of sequences. DP produces optimal alignment for a pair of sequences but time complexity increases for multiple sequences. DP involves following steps [17]: ⚫ Every nucleotide in one sequence is compared to each and every nucleotide of the second sequence. ⚫ Results of this comparison are marked and stored in the form of m*n matrix where m*n defines size of the matrix. ⚫ All paths in the matrix are searched to find the optimal alignment with highest score. DP provides optimum alignment for a given objective function for pairwise sequence alignment problem by trace-back procedure whereas this trace-back procedure involves exponential time for MSA [18]. MSA is an NP-complete problem where aim is to identify an MSA with maximum score among the set of found alignments. Hence, for MSA, more sophisticated heuristic methods are required [19]. Agarwal et al. (2005) proposed a more efficient version of DP which produces an optimal alignment. Bayat et al. (2019) proposed a DP based method which produces semi-global alignment where few of the first and last bases of compared sequences can be skipped. This method extracts ‘Maximal Exact Matches (MEMs) from compared sequences using shift and compare operations on the two sequences. This method is suitable where number of MEMs is lower than total number of bases (or nucleotides) in the sequences. 2.2 Heuristic Approach Heuristic methods are not capable of giving optimal solutions but they provide feasible solution in short amount of time in contrast to DP based approach i.e. exact alignment approaches which provide high quality, near optimal results [18]. 2.2.1 Pairwise sequence alignment For pairwise sequence alignment, BLAST, BLAT and FASTA are the popular tools based on heuristic approach which produce faster solution in short amount of time. 2.2.1.1 BLAT: BLAT [20] stands for Blast like Local Alignment Tool. BLAT was written by Jim Kent. Its working principle is similar to that of Blast with an improvement that it stores index of reference sequence in memory rather than storing full sequence leading to low memory requirement and enhanced speed of alignment. Index is used to find areas of homology which can be further loaded into memory for a detailed alignment. BLAT is for finding sequences with

A Comprehensive Review and Performance Evaluation of Sequence Alignment Algorithms for DNA Sequences

Bioinformatics Study of Lectins: New Classification and Prediction In

Trichoderma Reesei Complete Genome Sequence, Repeat-Induced Point

Introduction to Bioinformatics (Elective) – SBB1609

GPS@: Bioinformatics Grid Portal for Protein Sequence Analysis on EGEE Grid

Software List for Biology, Bioinformatics and Biostatistics CCT

A New Graphical User Interface to EMBOSS

EBI, Expasy, EMBOSS, DTU)

Web & Grid Technologies in Bioinformatics, Computational And

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

A Converter Facilitating Genome Annotation Submission to European Nucleotide Archive Martin Norling1,2, Niclas Jareborg1,3 and Jacques Dainat1,2*

Bioinformatics Approaches for Functional Predictions in Diverse Informatics Environments. Paula Maria Moolhuijzen, Bsc This Thes

Genetic Variation Regulates the Activation and Specificity of Restriction-Modification Systems in Neisseria Gonorrhoeae